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In past Head Start evaluations, cognitive measures 
have been weighed heavily. This has not accurately reflected the 
relative unimportance of cognitive program goals; child performance 
gains are not an objective with high priority for most Head Start 
programs. Evaluation planners need to weigh previously encountered 
measurement problems carefully and decide to adopt either a 
reliability-based strategy placing emphasis on careful test 
administration or a validity-based strategy assuming that what is 
needed is a fundamental reconceptualization of the measurement of 
cognitive effects, developing new measures. As priorities for 
cognitive measurement, this study argues that the new evaluation 
should stress readiness, cognitive process, and social competency and 
if it is decided to adopt a validity-based strategy, lists of clearly 
defined behavioral objectives must be drawn up in those realms of 
stress and then to create or adopt instruments to measure these 
objectives. What is needed is a battery of face-valid, empirically 
based, criterion-referenced instruments intended to measure 
short-term effects. Choice of measures is integrally related to 
choice of evaluation design. The new evaluation might consider some 
departure from pre- and post-testing, instead testing three times 
during the year or only once at the end. (RC) 
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PREFACE 



This report arises out of the Rand study to design an evaluation 
o'f social competence in Head Start children for the Office of Child 
Development (OCD) , HEW. It was written as the keynote paper for one 
of four panel meetings of child development experts. The panels were 
convened by The Rand Corporation and OCD to identify candidate out- 
comes, measures, and research strategies, and difficulties with them, 
for a national evaluation of Head Start. 

The report was prepared for the cognitive effects panel, held in 
New Yorff October 17 and 18, 1973. It is intended as policy analysis 
to help OCD generate evaluation options. It weighs political consider- 
ations as well as those related to research design and asks what kind 
of evaluation would be most useful to ^number of different audiences— 
from the Office of Management and Budget (OMB), to the Secretary of 
HEW, to the Congress, to parents of Head Start children, to the uni- 
versity-based research community. 

Mr. John Butler, author of this report, is editor of the Harvard 

Educational hcvieu* 
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PURPOSK OF (•CH-NJTIVR F.FFKCTS MEASItKKMKN'r 

Heiisionmakers judging the value ot Project Head Start are likely 
to use fo^r basie evaluation criteria: 

o Is the program wel i-lmplemcnted? 

o Does it have a political constituency? 

o Does it make sense as a way to support low- 
income parents and families and to provide 
them with child care/ 

o Din's it accompli tilt anything for children 
beyond custodial care? 

These criteria are probably ranked by many in descending order of 
importance, suggesting that cognitive effects, included under the fourth 
criterion, are by no means the only basis for policy decisions. in the 
future, however, funding decisions may be based more heavily than before 
on performance outcomes. The program's political support may not be as 
vocal or well-organized as it once was; now that there are numerous other 
programs for poor families competing for the same funds, the Congress, the 
Office of the Secretary of HEW, and the Office of Management and Budget 
may give the program new scrutiny on grounds of cost-effectiveness. 

? The primary audience for a new evaluation comprises national legis- 
lators and agency personnel. It is not clear what must be demonstrated 
to convince these groups of Head Start success in the realm of cognitive 
effects. Five positions seem tenable: 

1. In a randomly selected group of Head Start programs offsets 
nus,: /•< j*nwm!iwt and preferably must persist into the 
elementary grades; 

2. Pin<!c of lh-.il r'>t)*f pvyfmw mu^a.-fn'rih- a thlr 
•. „•»••.*,• £»v effect?, with participant children; 

J. There must be strong cognitive benefits for *»•»*! HulfiW'uya of 
//'•;/ i'jUtrl <*hxttivon\ 



4. Cognitive effects need only be demonstrated as a moderator 
variable— cognitive goals are not the principal outcomes to 
be i'Vaftuztt'rf; 

5. We do n<*t <it pwtit'Ht. hvJ*i thv m^auuvrmcnt tcahnolotjy to assess 
llrad St a ft f s cotptltiVt! t % ffe&tQ. 

Many policymakers currently espouse the point of view that Head Start 
must show universal and lasting effects* This raises a fundamental diffi- 
culty: It can bjt predicted that no evaluation design looking only for 
generalized effects is apt to tefceh us anything not learned from the two 
previous national evaluations of the program (West inghouse-Ohio: Cicirelli 
et al M 1969; Planned Variation Head Start (PVHS) : Smith et al., 1973; 
Weisberg, 1973); in addition, no evaluation pursuing longitudinal effects 
with sufficient care is apt to be worth the money. If evaluators adopt 
position one* they are apt to learn little new about which programs are 
working and why. It is therefore impprtant to shift the terms of the 
evaluatibn away from this position. 

Positions two and three have considerable appeal if the evaluation 
can take the form of a small-scale, we 11 -con trolled study of certain pro- 
gram prototypes, perhaps ranked according to cost of delivery. Such a 
study would be most effective, however, with true randomization of chil- 
dren to programs, clear operational definitions of program prototypes, 
and sufficient controls. None of these were evident in the Planned Vari- 
ation Study, which proved at best a preliminary, hypothesis generating 
vencure. To some extent a careful study may involve trading full repre- 
sentativeness of the sample and a large battery of measures for increased 
depth of analysis on some smaller group of children or programs. 

Position four also has definite appeal, especially if the evaluation 
is to emphasize outcomes in the realms of health and nutrition, social 
development, or the effects of Head Start on the family. Position five 
is maintained by certain skeptics, but as a practical matter it must be 
rejected because some evaluation, however imperfect, is required. 

ROLE OF COGNIT IVE EFFECTS MEASURES IN THE HEAD S TART BATTERY 

In past Head Start evaluations, cognitive measures have been weighed 
heavily. This has not accurately reflected the relative unimportance of 
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cognitive program goals; child performance gains are not an objective with 
high priority for most Head Start programs. In general, past evaluations 
also have been plagued by three measurement problems; low quality of 
the field operation for test administration and other aspects of data 
collection; p^or theoretical rationale for the Individually administered 
cognitive tests in the Head Start battery; and equally poor theoretical 
rationale for observational and Interview measures, with the additional 
problem of low reliability for such measures. 

Evaluation planners need to weigh these problems c*. tully and de- 
cide whether to adopt a veliabiLity-bc&tfd strategy in devising the new 
evaluation, or a validity-hawd one. A reliability-based strategy would 
accept as given a limited number of the best available instruments and 
place emphasis on careful test administration. New data need not be 
different In kind from past data, only of better quality. A validity- 
based strategy would make a different assumption: that we need a funda- 
mental reconceptuallzatlon of the measurement of cognitive effects, devel- 
oping new mea^res. This strategy would require more time to fulfill. 

Cognitive effects can be loosely divided into five realms: (1) norra- 
based kindergarten or first -grade readiness; (2) theory-based developmental 
shifts; (3) changes in cognitive process; (A) social competency and aware- 
ness; and, O) general knowledge. As priorities for cognitive measurement, 
this study argues thac the new evaluation should stress readiness, cogni- 
tive process, and social competency. 

IMJOJiH i^PJl'TAVA A F X E . C - T .s_ MOT** 

It may be unwise to spend additional funds on the development of new 
instruments. There Is ample evidence from laboratory-school studies as 
well as from the two national evaluations of Head Start (Westinghouse- 
Ohio and PVHS) and the Educational Testing Service (ETS) Longitudinal 
Study that good Head Start programs show consistent short-term effects on 
a variety of measures. A new set of instruments might show only the same* 
pattern again. 

If evaluation planners do decide to adopt a validity-based strategy 
and devise new measures, they need to begin by making lists of clearly 
defined behavioral objectives in the realms of readiness, cognitive 
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prttt-fHri, and social competency ami then create or adopt instruments to 
measure tltes • object Ives. What is needed is a battery of faro-valid, 
empirically nased, rrlter ion-referenced lorn ruments intended to measure 
short-term effects. At present there is a paucity of good measures of 
this type. 

A related issue is the appropriate balance of individually admin- 
istered tests to observational measures or rating scales. In general, 
individually administered tests are more reliable but tend to measure 
too small a slice of the child's world. Other instruments are higher 
in risk but also higher in potential gain: They are more likely to.be 
unreliable or of low validity, but if they successfully overcome these 
obstacles they stand to be more persuasive than other instruments. 

Choice of measures is integrally related to choice of evaluation 
design. Past evaluations have tried to investigate too much at once, 
throwing even the most elementary conclusions into doubt. One persistent 
problem, as an example, is that the same tests may not be appropriate for 
both four and five year olds. 

The new evaluation also might consider some departure from pre- and 
post-testing, instead testing three times during the year or only once 
at the end. 



/ 



it 



ix 



ACKNOWl.KOCMKNTS 



The author is Indebted to various experts in child development, 
testing and measurement, and program evaluation for their ideas and 
wisdom about past experience. Among them are .Joan BlsSell, Jerome K4gan, 
Gerald Lesser, Kit hard- Light, Kn-deri.-h Most el Wr, David Mundel, Viekl 
Shipmau, Marshall Smith, Sheldon White, and Susan Wonlsey. 

/ 



iU 



A 



CONTENTS 



PREFACE iU 

SUMMARY V 

ACKNOWLEDGMENTS 1X 



Sect ion 

I. INTRODUCTION X 

U. PURPOSE OF THE COGNITIVE EFFECTS EVALUATION 2 

The Functions of Previous Evaluations 1 

Criteria of Cognitive Success ' 

Which Position Should be Adopted? 28 

III. THE ROLE OF COGNITIVE EFFECTS MEASURES 

IN THE HEAD START BATTERY ; JO 

Problems with the Past Measurement of Head Start s 



35 

Cognitive Effects TZ 

What Is To Be Done? 7jj 

Likely Realms of Cognitive Effects *° 

Priorities for Measurement 

IV. THE NEW COGNITIVE EFFECTS BATTERY £7 

Individually Administered Tests °* 

Classroom Observation Instruments ■* 

Home and Neighborhood Observation Instruments bA 

Parent, Teacher, and Sibling Interviews and Ratings 55 

Instruments for Collecting Incidental Facts 

About Reducations in Social Costs j« 

Balance Among Types of Instruments °° 

Measurement Strategy and Evaluation Design &V 

74 

V. CONCLUSIONS 

BIBLIOGRAPHY 77 



11 



BEST COPY AVWUfflU 



i. INTROIMTTION 



Kach of the three major sort inn:: oi this report is a discrete unit, 
but iMch follow* trom the previous sect inn In it* logic and increasing 
U'Vi 1 of specificity. Sert Inn IT is » consideration of ;</•;.'. V 

'. . ••' ». * \\ Why arc we looking .it cognitive 

effects of Head Start at all? What should an evaluation of cognitive 
effects set out to demonstrate or explore? Some of the issues raised 
are gl-neri« to Head Start evaluation, applying as readily to other 
measurement domains as to the measurement of cognitive performance. 
Other issues surround the rule of cognitive measures in particular, 
what they have meant in past Head Start evaluations, and what rhev 
should be designed to accomplish In the next. 

The remaining sections deal more direct 1/ with practical and tech- 
nical questions of measurement. Section 111- discusses .'>:•' r • ■; • - 

uwfl within the Head Start battery. What has been don,' 

J mi 

in previous Head Start evaluat ions to assess dimensions of cognitive per- 
formance, what are some of the problems i n measurement, and what can h«- 
done to improve the test battery itself and the qualitv ot the data 
generated in a new evaluation? Section IV but ids on the cm I «»-. 
of the previous section and asks what kinds of instruments should be 
included in the '..v.? -c,#..? '■:».• .-;V* 'V "'•••'*.<• < i<>raain oi cw * f,i " 

tive competence, what are appropriate behavioral objectives? Categories 
of measures are listed, ranging from Individually administered pre- and 
post-tests, to classroom ohserval ion instrument <• , to interviews and 
rating scales. "Best bets" are considered among established measures 
and promising new ones. The report concludes with a brief discussion 
of the relation between cho»<e of cognitive effects instruments and 
choice of overall experimental design. 
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1 1 . PURPOSE OF THE. f^NtTIVE EFFEC TS EVALUATION 



Most would agree that a program evaluation should be decision-related— 
designed to enable policymakers, researchers, parents, or others to make 
rational choices. Too often researchers have applied an analysis of vari- 
ance model to the world without .first asking why they were doing it and 
what kinds of information it is apt to generate. Who are the primary 
audiences for the evaluation? What are the%inimal sufficient data that 
can tell us what we need to know? And within budget constraints, which 
evaluation strategy will yield the highest return in valid and useful 
information given the dollars it costs to implement? 

THE FUNCTIONS, jlF^PREVlOuS EVALUATIONS 

In designing a new national evaluation of Project Head Start, it is 
helpful to^egin by recalling the purposes of past evaluations and consid- 
ering how their results have been used. The first national evaluation of 
Head Start (West irtghW'-Ohio, Cicirelli et al., 1969) was an impact study, 
intended to firtd out whether Head Start programs t,i the aggregate were 
having any effect." Although at that time the program was still too young 
•or conclusive judgments about its success, questions o cost-effectiveness 
wore in the minds of many: Was Head Start a wise expenditure of federal 
funds or could comparable sums of money better be spent on children in % 
some other way? The Of f ice of Economic Opportunity, then sponsor of the 
program, was to provide an initial estimate of the program's effective- 
ness as preliminary data for a rationally based, go no-go decision on 
Head Start for coming fiscal years. 

The Westinghouse-Ohlo evaluation placed heavy emphasis on measures 
ot children's cognitive performance. It tried to answer one basic ques- 
tion: Arc pre- to post-changes in the performance of children in a ran- 
domly selected group of Head Start programs higher than those experienced 
by comparable children without any program? The design looked for effects 
generalized across the entire Head Start population, regardless of partic- 
ular center, location, or child subgroup. Head Start children were com- 
pared : h • with children without any preschool experience. 

13 
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The methodological pros and cons of the study are amply discussed : 
In an exchange between the principal investigator, Victor Cicirelli, 
and Smith and Bissell in a 1969 issue of the Harvard Edutatiorial Rctfia^ 
The actual use of the evaluation by policymakers, however, has ne~2r been 
formally analyzed. Several hypotheses can be ventured, based on conven- 
tional wisdom about the influence of the report. First, its principal 
finding—only very slight effects across programs on the most reliable / 
measures, not enough to impress anyone with Head Start outcomes— 
probably served to dampen the enthusiasm of many liberals and policy 
researchers about the prospects for an early childhood "cognitive 
inoculation" against the ravages of poverty. There were no apparent 
quantum jump in the cognitive competence of Head Start children compared 
with other children. Also, and more disturbingly, the evaluation pro- 
bably reconfirmed the belief among many, conservatives that Head Start 
efforts were a fool's errand. Results could be interpreted in support 
of the view that environmentalists had been too sanguine about the 
malleability of early intelligence and cognitive performance. 

Second and equally important, however, was a political groundswell 
supporting Head Start and believing that the terms on which It had been 
evaluated did not accurately reflect the goals envisioned by its archi- 
tects or community participants. In some rases, this opposition took 

the form of -scholarly rebuttal of the West inghouse- Ohio Report, but 
scholarly response was probably less important than the fact of a 
continued, strong political constituency for the program in the field- 
Head Start parents, teacher*, and other supporters— who believed Chat 
Head Start HI make a difference, that it was worth the money, and Chat 
it would he a mistake to end the program. In general, liberal support \ 
for Head Start at various levels sustained itself despite lukewarm eval-\ 
uation results. Although the program was closely scrutinized from then \ 
on, no decision to curtail the program was made on the basis of the findings. 

The recent Planned Variation Head Start Evaluation (PVHS 1969-1971: 
c ei Stanford Research Institute, 1971; Smich et al. t 19/^; Weisberg, 1973) 
was both more sophisticated in research design and more astute in its 
anticipation of political implications. It did not directly address 
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the issue of generalized effects —the go no-go question— instead 
it nskod another question: Among various. Head Start prototype programs 
developed at laboratory schools, which ones were most effective when 
replicated in a field situation? Also, what were the differential 
etiects of the various programs when compared with each other and with 
traditional, non-spensored Head Start programs? The evaluation asked 
not whether Head Start was succeeding on the whole, but rather WhLih 
Head Start programs were succeeding or achieving unique results. The 
Office of Child Development (OCD), now sponsor of the program, had in 
mind an incrementalist strategy: Discover which programs are most 
successful and then build on them for the future (see Light and Smith, 
1970). Findings were intended to inform two kinds of decisions, those 
by the agency itself about which programs to support most heavily in 
the future and those by parents and communities about what kind of. 
prototype curriculum best suited their needs. All children in the 
study were attending either sponsored Head Start programs, based on a 
prototype, or traditional programs. Aggregate comparisons of- Head Start 
and nor-Head Start programs could be made only by pooling all of the 
data and using prescores of older children in the sample to simulate 
a non-Head Start control group. 

PVHS was one of the most ambitious natural experiments yet attempted 
in education at any level, and its full implications have yet to be fully 
sorted out. But it is clear that many of the problems of the Westinghouse 
Ohio evaluation recurred in the attempt to extract policy implications 
from the results, and some new problems arose. Except in the case of a 
few sponsored programs, effects in PVHSras assessed by traditional mea- 
sures of cognitive performance continue to be slight. Proponents of 
some programs continue to say that evaluation instruments did not mea- 
sure what their programs were setting out to accomplish. Detractors 
continue to say that most Head Start programs do not have sizable effects 
and are not worth the money. Differential effects, the main area of 
exploration, apparently have not as yet been the basis of any policy 
decisions by 0<:D about which prototype programs to support for the 
future or any decisions by community groups about which program 



configurations are apt to best serve their needs. (For an ample dis- 
cussion of problems in making policy inferences from studies of dif- 
ferential effects, see Stodolsky, 1972.) 

The only other national study of Project Head Start, the ETS 
Longitudinal Study (see Shipman, 1973) » still has not been fully 
analyzed. Its intent is less explicitly policy-related than either 
the Westlnghouse-Ohio or PVHS evaluations, and its architects do not 
expect fundamental policy decisions about the future of the program 
to be based on their findings. 

There is much still to be learned about the relation between 
evaluation results and program-related policy decisions (see, for 
instance, Cohen, 1973). But in the case of Project Head Start this 
much is clear: Budget decisions from year to year have Reflected little 
of the direct influence of evaluation results. Inflation and extension 
of program services to new realms have necessitated many program cutbacks, 
but as yet there has been no dramatic dismissal of the program by the 
Congress, the public, the Office of the Secretary of HEW, or the Office 
of Management and Budget. Head Start's budget has risen since 1965 
despite evaluation outcomes. 

What should this past experience tell us about the measurement of 
cognitive performance in a new national evaluation? First it should 
make us reexamine the significance of cognitive effects as they relate 
to decisions about overall funding. In general, evaluation results are 
only one of many indicators in a complex political equation determining 
whether the program is sustained, curtailed, or subsumed under another 
program. Decisionmakers are sensil 
to program effects, and four basic 
opinion of Head Start's value: 



curtailed, or suDsumea unaei miuiuct 
Ltive to program popularity as welllas 
: criteria are likely to influence their 



1. Ut tk.' f}*ognm vji'll-impUxncntcd? 

As an input consideration, do Head Start centers look in the field 
as they should according to written descriptions? Is there an efficient 
delivery system and management structure? Are the program's various 
components functioning well? 



it; 



2. Does the program tiaVe a political constituency? 



v 



Art* there sufficient numbers of parent s, community members, and 
agency employees who like the program and what it tries to accomplish? 
Are these people powerful enough in their numbers and lobbying finesse 
to push for budget increases? Prevent budget cuts? 

\ 

\ 3. Does the program make sense ae a way to support l&tf-ineome 
V- patents and families? 

Is Head Start the best mode of child care delivery? How does Head 
Start articulate with other federal programs for the poor, such as AFDC, 
child care under Title IVa of the Social Security Act, and Medicaid? 
Should it compete with them for funds? 



4. Does the pr^fgr^\acc^ 

giving them basic custodial aa^e 

Are there measurable developmental or educational benefits of 
this program not experienced by non-Head Start children or children in 
custodial care programs? 

Policymakers probably rank this rough and ready set of criteria 
in descending order of importance. If so, it is not surprising that 
past evaluation data on child performance in Head Start has been used 
selectively, often to rationalize decisions made for other reasons or 
to bolster a preconceived view of the program 1 s value. 

A second conslusion from past experience, as a corollary to the 
first, is that in general political support for Head Start is not 
dependent on evaluation results and, conversely, probably must be sus- 
tained independent of such results. OCD should not assume it can defend 
Head Start by evaluating it. If the program needs more friends in 
influential places, OCD should consider creating advisory panels, 
talking again to congressmen, and establishing a broader base for the 
coalition supporting the program, including businessmen and others not 
usually included. There may even be a need for a new and full-scale 
public relations effort i imply to remind the nation that children and 



parents are enthusiastic about the program, that centers are clean and 
well-organized, that teaching is the best available, and that Head 
Start offers numerous indirect community benefits. 

To say that evaluation results will not relate simply and directly 
to funding decisions, however, is not to say thai such results are 
inconsequential. Strong positive or negative findings, In particular, 
might be weighed more heavily now than in the past. At a time of fed- 
eral budget cutbacks the government is more serious than it once was 
about criteria of cost-effectiveness.. This is especially true if 

Ntiberal supporters who backed the program in the 1960s and early 1970s, 
despite ambiguous evaluation results, are not so enthusiastic as they 

v once were. 

■ x If the primary audience for a new evaluation is the decisionmakers 
who determine future funding of the program, a third conclusion can be 
drawn: We still do not have any clear decision rule in the domain of 
cognitive effects for what would constitute program success. In the 
past, results have been a Rorschach blot of sorts, open to varying post 
hoc interpretations. There has been no operational definition to tell 
us when Head Start Is "working" or "not working." This problem is 
especially thorny in that there are numerous reasonable, competing 
conceptions of success. The next section discusses the issue of success 
criteria in some depth. 

CRITERIA OF COGNITIVE SUCCESS 

Five basic positions have been taken regarding a sufficient demon- 
stration of Head Start's cognitive effects. Each leads to a rather 
different evaluation design and a different role for cognitive measures 
within that design. The positions are presented here in order of de- 
scending stringency: 

o 1. In a randomly selected group of Head Start 

programs, there must be demonstrable short-term cognitive 
effects for participant children that are not enjoyed by non- 
Head Start children. These effects must be generalized 
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acioss centers and preferably should last into the 
elementary school grades. 

The logic o. "go no-go," which dictated the Westinghouse-Ohio design in 
1967-68, is reflected in Position 1. Many policymakers wish to establish 
whether Heau Start programs in the aggregate have effects on participant 
children—whether there are generalized Head Start effects not enjoyed 
by children outside the program. In the Westinghouse-Ohip Study, 
Ctolrelli and his colleagues were responding to this question, antici- 
pating that overall conclusions were more important than fine-grained 
analysis of whether Head Start worked better for some children than 
for others. 

Arguably any positive evaluation must show effects for the aggregate 
Head Start population. It may not be enough to select a group of the 
best Head Start centers and compare them with each other and' with tradi- 
tional centers, as was done in PVHS, or to compare exemplary centers 
with non-Head Start controls. If effects generalized across all centers 
are the fairest estimator of what the government is getting for its 
investment, then the evaluation design must involve a random sampling 
procedure or at least a representative stratified sampling procedure in- 
cluding centers of all types. 

A second aspect of evaluation, which did not play a role in the 
design of the Westinghouse-Ohio evaluation but has occupied researchers 
based at lab schools and those doing follow-up studies for the past 
several years, is trying to measure whether effects last over time. 
This has also been a question of great interest to policymakers, some 
of whom believe it is necessary to demonstrate lasting effects in 
order to justify continuation of the -program. Program success can be 
demonstrated only by showing such effects and would be conclusive if 
effects for all Head Start children, or for a significant proportion 
of the children „ were maintained well into elementary school. 

Although Position 1 is the dominant view of many, I will argue 
that it would be a serious mistake to design a hew evaluation that 
assumes these are the most valuable criteria of success and failure. 
Let us pursue further the logic of designing an evaluation to demonstrate 
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generalized effects, then lasting effects, to see the pitfalls that 
await if we adopt this position. 

First, it may not be feasible or desirable within the OCD evalu- 
ation budget to administer a full battery of tests to a nationally 
representative group of children. The major issue here is the cost 
of administering various kinds of tests at acceptable levels of reli- 
ability. Curiously this is not something that has ever received 
systematic study in the context of Head Start evaluation. It is clearly 
better to do well an evaluation of modest proportions that can inspire 
the faith of policymakers, professional researchers, and community 
participants because it is well-executed and its results are reliable, 
than to do something overblown and unconvincing. In the trade between 
quality and scope, reliability of measurement has to be emphasized in 
the first instance, even if It means considerably reducing the number 
of Head Start centers or children in the study. 1 This in itself, given 
budget constraints, may rule out a new national impact study. , 

Another problem of any generalized effects study is that all kinds 
of programs must be represented, or at least have an equal chance of 
being represented, and variations now abound. With the advent of the 
Improvement and Innovation program, involving substantially different 
time and place options within Head Start, it is not clear that there 
is any longer much reason to consider Head Start a single program. 
Perhaps there never was. From the standpoint of Position 1 the level 
of analysis that most interests us is the highest one, pooling every 

1 Along with other information on test standardization it would be 
interesting to know (1) how much it costs to train and pay those admin- 
istering a given test to do so at an acceptable level <>f reliability 
in the field, (2) how much it costs to mount a field operation of size 
x that would yield acceptable data on the test, (3) the tradeoff between 
reductions in price of administering a test and the resultant marginal 
reduction in reliability, and (A) the trades between different kinds 
of tests (e.g., individually administered pre- and post-tests as 
against classroom observation instruments) in cost and reliability of 
the data collected. Cost per test for any kind of measure in the bat- 
tery could be plotted as a function of testing procedure, length of 
test, training necessary for its administration, acceptable level of 
reliability, and other variables. This kind of function would enable 
rational decisions in an arena where to date decisions have been made 
impressionist ically. 
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program and every child. This not only assumes a homogeneity of offer- 
ings, which may not be accurate, but also that effective programs are 
best considered side by side with ineffective ones — that the grand mtsan 
is more important than means for particular kinds of programs and groips 
of children. Neither assumption seems wise. 

Following the logic of Position 1, we ~an estimate what in Bayesien 
statistics is called a "prior" — a preliminary guess about the likely 
magnitude of effects in the evaluation. The effects of interest would 
be differences between t'es* gains for Head Start children and test 
gains for non-Head Start children over the Head Start year, as in the 
Weatinghouse-Ohio evaluation, or simple differences at post-test on 
criterion-referenced measures if at the outset there were trun random 
assignment of children to treatment groups. The magnitude of the dif- 
ference between gains of Head Start and non-Head Start children on 
many cognitive tests probably can be estimated with reasonable accuracy 
simply by looking at. past evaluations that compare children in tradi- 
tional or non-sponsored programs with children not attending Head Start 
at all. Accumulating evidence from a variety of studies (Light and 
Smith, 1971), the evaluation designers could establish overlapping 
distributions as in Figure 1, one for gains of children who experienced 
Head Start and the other for those who did not. The difference between 
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Fig. 1 - Distribution for children who did 
and did not experience Head Start. 
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the means of the two distributions, with an appropriate confidence in- 
terval around it, would be the prior expectation of Head Start effects. 
Knowing laws of sampling error and prior estimate of effects, the eval- 
uation designers can next establish sample size, if need be selecting 
a large enough sample to give a reasonable assurance of significant 
results. 

All of this sounds reasonable and rational. But it leads to pro- 
blems when applied to a design based on Position 1. First, a prior 
estimate of differences between the aggregate Head Start and non-Head 
Start groups on most reliable, individually administered tests might 
turn out to be as small as a quarter of a standard deviation. This 
suggests immediately that if a sample is going to be representative, 
avoiding obvious problems of non-sampling error, it is also going to 
have to be very large. We also must ask whether it is enough to show 
psychometrically reliable differences between groups, or whether dif- 
ferences of such a size, however well-measured, will still be seen as 
trivial by policymakers and others. There is no good evidence about 
how lar^ a gain has to be before it is taken seriously, but a quarter 
of a standard deviation, whether or not it is statistically significant, 
probably is not enough to excite anyone greatly about Head Start's 
short-term cognitive effects. Carl Bereiter (personal communication, 
July 1973), for instance, once asked his introductory psychology class 
how large a Stanford-Binet IQ gain would have to be before they were 
impressed that it •'made a difference." He did not attempt to define 
what this phrase meant. Almost everyone gave an answer in the vicinity 
of eight points — or half a standard deviation. 

The question of what is a sufficient gain to interest policymakers, 
apart from statistical significance, is not unimportant. Planners may 
be forced to make estimates of the face validity of change scores. If 
no reasonable estimate of Head Start effects that could be anticipated 
from a design matching Head Start children with non-Head Start children 
would yield very sizable overall effects, even in the short term, then 
this is an important reason to question the logic of Position 1. Once 
this position has been adopted there may be no way to impress policy- 
makers favorably, because a demonstration of statistically significant 
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differences, which no doubt could be engineered for a price, simply 
would not yield sizable enough results. In addition, of course, such 
an evaluation strategy would not in the first instance tell us how 
various Head Start programs .HffVr in their effects, an issue that 
PVHS was intended to explore and about which there arc a number of 
interesting hypotheses. 

There is also a problem in trying to show that effects last over 
time. We know that for non-sponsored programs— traditional programs 
not based at any lab school or involving any specially developed curri- 
<uium~ short-term effects of preschool tend to wash out soon. We also 
know that even for the best lab school program, effects are difficult 
to sustainbeyond the first three grades of school (see, for instance, 
Stearns. 1971, and S. White et al. , 1973.) * 

There are two camps regarding the further exploration of longitud- 
inal effects. The first group maintains that such effects could be 
demonstrated, or at least explored, if we were willing to invest the 
money to mount a research effort sophisticated enough to find them- 
Such a design has been weakly approximated in the national evaluation 
of Project Follow Through, but because of cohort attrition, non-compar- 
able treatments, non- comparable child populations receiving different 
treatments, and other design problems, there has been no pretense that 
this is an adequate study to tell us what we would like to know. The 
only longitudinal studies to date that have approached sufficient 
methodological rigor have been those tracing small, lab preschool 
groups into elementary school. Even these studies have often been 
suspect, with inadequate controls and blinds in follow-up assessment 
procedures. For those who advocate exploration of longitudinal effects 
the issue Is not whether such studies are feasible and valuable, but 
whether as a practical matter we are willing to spend the money and 
perhaps exert the necessary persuasion and social control to follow 
well-matched groups of children through the elementary school years, 
despite the difficulties created by high rates of geographic mobility, 
the need to orchestrate treatments in the elementary schools, and the 
enormous sample size that would be required for valid inference about 
effects, which are apt to be marginal. 



A second Youp doe* not believe the enterprise of exploring 
iongevij^T^tects is at all valuable. Along with certain experts 
in r'e^rch design and methods, this group includes economists con- 
cerned about the misuse of coat-accounting procedures and a number of 
educational sociologists pursuing the logic of the findings contained 
ln inequality , by Christopher Jencks and his colleagues at Harvard's 
Center for Kduc«tl<HUd Policy Research (1972). The principal argument 
put forth is that although such a long-term research effort may be 
feasible, any effects would be small indeed, explaining only a negli- 
gible portion of the variance in subsequent school grades or other 
later outcomes of interest. No effect,, of any curricuiar intervention, 
however well organized, can be expected to last five years; and to 
formalize this criterion as a measure of Head Start program success 
is absurd. The cost-benefit economists in opposition add that any 
intervention probably will explain so little of the variant in subse- 
quent school performance that even if it could be" demons t rated that 
„.•.,.„,. , . : nt uta Head Start had a greater effect or more explan- 

all >rv power, no policymaker or cost-accounting expert would be im- 
pressed enough with the marginal differences to act on the basis of it. 
It Head Start is to be viewed as an investment, then the real question 
U one ot "best bets" about maximum social return for each dollar 
invested in the child; unfortunately at present we have no* systematic 
wav of comparing the relative benefits of programs for older and 

younger dii Idren. 

The sociologists make a related point. If we look at the Jencks 
et al. work, no school-related input in the lives of young children at 
any s.v i ..canonic level currently seems to have much effect on sixth 
and twelfth grade achievement. Moreover, as a second missing link in 
the .haln. these later achievement scores do not se'era to predict 
strongly to various adult success criteria of interest, notably adult 
income. School effects in general do not seem to have much to do with 
kp,UI mobility or with aspects ot adult success that really matter. 
' It nothing related to schooling at any level predicts strongly to 
:~^rtant omes later, whv should we expert this will be any dlf- 

I~r*nt for Head Start, and why should we make lasting effects a criterion 
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of program success' We do not discontinue schooling on these grounds. 

In weighing the merits ot arguments tor and against a longitudinal 
effects evaluation, we must give the first camp its due. No one has 
ever attempted a careful follow-up design in the field, and such a 
study no doubt could be performed if money were forthcoming. In 
addition, as a political fact, many policymakers for better or worse 
have become wedded to the notion that Head Start must demonstrate 
lasting effects in order to justify itself. We as researchers have 
trained policymakers to think in such, terms, and now we may find it 
difficult to reverse this line of reasoning. 

But it clearly makes most sense to side with the second group, re- 
jecting the predictive validity of gains as a success criterion. First 
to do a longitudinal effects study in the field would be too expensive 
to be worth the mom-y. In itself it would not withstand cost-benefit 
analysis: Results would probably he meagre and their policy impli- 
cations unclear. Second, to accept this success criterion is to im- 
pose an unfair burden on the program. In other federal program evalu- 
ations it is almost always sufficient to demonstrate success in the 
short tern; only. Thus, for instance, Medicaid expenditures are nor 
...... ,m.. i,. »<.';...! , . r Mm. r.. »...w th.-v affect the life expectancy 

of the patient or how they reduce the probability of his returning to 
the hospital wtfh some new probjei- tour year-* hence. The aggregate 
effects of the Medicaid program for the price can be compared to the 
effects of previous programs, hut since there is little similarity 
between a program like Head Start and previous s«. hemes for the poor, 
such an approach is not very useful. 

There is a need to shift the terms of the debate about account- 
ability, proposing reasonable competing con<eptions of accountability 
and reasonable ways of justifying Head Start', existence without re- 
quiring that it demonstrate longitudinally s'able effects. To the 
extent we eqiate "giving OMB and the Secretary of HF.W's office what 
they want" with mounting an evaluation based on Position 1, we have 
made a serious mistake. 

A third design aspect, whi< h escaped consideration in the Westing 
house study and subsequent Head Start evaluations but is wholly in 
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keeping with the le*:. •! »:ost-e M ect i veness and go no-go, is trying to 
a*»e** whether Head Start is a better investment than sor^e -ther progr*™ 
for the same target population. It can be argued that decisions dbout 
the future of Head Start are Pore dependent on assessment of its merit 
in comparison with federal expenditures for other kinds of children's 
programs than on assessment of program effects In themselves. No 
evaluation of Head Start has attempted to compare its costs and effects 
-with those -f another program, such as /Sesame Street. But there has 
been impetus t*rom various sectors of the agency structure for just suth 
comparisons. Certainly this was one of the reasons for OCD's recent 
report on compensatory programs (S. White et al. , 1973)* 
It is hard to think haw this third design feature could be built 
int. ^ Head Start evaluation without being artificial or raising a hue 
and crv J r;-r> ^anv who would rightfully in- i :hat .'had Start was once 
again hemic r.ade to cantor to a procrustem set of evaluation criteria 
having little to do with its intended air.s. Thus, for instance, if it 
were KUk-rft-.sted that the Se^ar.e Street to^r hatterv be applied to He.i * 
:.tart and that v, ._.:\p.i r i :n;ih .■! children 9 , pe t f > t ::.aiw e : . in both program 
••p. letter and nu-her re. oy.ni t i.-.n , under s! and i uk of relational terms , 

i:; ; V.-r f j-> ..■.»».!.; jrr.flv he ~. ide tisinf, r i <^ single a*t of tests 
n . n t . jt H* 1 .4-d M I T t W-'iirl f-r* pleased, and even the r.. ■» l ^a;i^:;i:;e ad- 
Vibrates " t-"t f *• t iv**!i» kl ^ pr^hihlv w»'»;Ia agree th.it sue h an appfea-. h 

is »ir ; I i >t I . 

, \ ^. It i . •■•iiiirient r~ 1eTi~nst r 1 1 e that 

Me a i - 1 art pr^^r.i"" a h i eve s i ! v ■ % *;n 5 1 i ve effects 
wit- pit it}, i pan! -.Li IJi't :i > 

if.e z.ost . I ten-rr.ent ione J alternative t- a -iirplt- i.r.pat Ison .if Head 
Start and n..r.-Head Start c.'-.ildn-n a differential efforts study 
. n::.pari tig various prototype program:.. 1:.::; is what !*'HS wan, and 
-a:;;, feel that it regain*- t'-e . *.t pr a::;lsi approach tor assessing 
Mead sf *rr effects. Any studv that low*-, in the Jirst instance at 
.i^^r t 1 1 e t* i * *. ■ i • v e,».* • . » ■ . . ■ r. * a. 

Wvaild expect such effects, with j-i-v,! hein* " : « v . r*j»fu! 



than others, U wakes little sense to locus primarily on the aggregated 
programs. 

a differential effects study focuses on interactions of program 
tvpe and other independent variables- The strategy was first dis- 
cussed in the aftermath of the Westinghouse-Ghio study (Light and Smith, 
19'0: >nith and Bis^e il t 19/0)* rheoreticallv it vould allow incremen- 
tal selection of good programs, an approach appealing to policymakers* 
If some programs seem to be working and others do not , then it makes 

ix< fund the ones that are. in addition* predictable inter- 
actions ot program type and child population, if any T have possible 
ut'lUv implications. Such a study may offer communities information 
for coming to a decision about which kinds of programs they vould prefer. 
I t thev know something about rhe predictable effects of a given pro- 
gram type, thev reav be able to make better choices about what seems 
nest tor their situation and their children. 

S. ^ program* also nay turn out to be more robust than others in 
their effects regardless ot child population, and policymakers may pre- 
fer ti- bark such programs. For purposes ot aggregated comparisons in 
-•uih a *»tudv the data subsequently may be pooled, enabling judgments 
. ;r..ilar t.> tr.—,c rude in general effe«*t<* evaluation. In the main, 
huwt-ver, the question of "whether Head Start works 11 is finessed; any 
s-om-luiion sav.s that it work* in some programs under scTmc conditions 
and not in otners. 

These are clear advantage* f but such a design also has several 
potential disadv.Mt ages. The first arises from lack of clarity about 
what a coherent educational "treatment" looks like. This is a lesson 
w#' havf learned fro* PV1IS. fn general, there are two ways to sort pro- 
at rams into tvp*>logy. One is to begin empirically, first going to the 
field, observing the full range of natural variations among programs, 
and then r ons t rue t i ng a matrix of dimensions on which programs 
differ. These dimensions become the treatment variables or one set 
o independent variables in a subsequent study. This approach makes 
it verv df inil r to evolve an agreed-upon list of program difference 
dimensions with adequate face-validity of adequate salience to explain 
much u f the variance in effects. We are forced to choose among 
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various partially face-valid grouping schemes, focusing on widely dif- 
fering aspects of ciassroom process, teaching style, instructional 
materials, *nd teacher-child interaction. Some sense of how hard it is 
to derive "natural variations" of classroom process can be gained by 
reading Jackson's Life in Classrooms (1968). No equivalent work is 
available for preschools. 

If we forgo the empirical approach, then the alternative is to turn, 
as PVHS did, to planned variations— "treatments" in the form of theo- 
retically different programs, each representing the best efforts of an 
individual or research team at a university-based laboratory school. 
Template programs are generalized to the field situation. In PVHS, 
a number of promising programs were selected, including among them 
prototypes with widely differing aims and teaching strategies. These 
programs could be grouped according to their differing philosophies re- 
garding teaching materials and te-hniqu<?s , degree of teacher-initiated 
activity, hours spent in didactic exercises as against free play, and 
so forth. In PVHS this approach generally supported a weak dimension- 
alization, with certain gross and face-valid differences between pro- 
grams on a dimension called "structure." But regrettably it did not 
support much more. Many of the purported differences among PVHS 
sponsors were not readily apparent when programs were visited in the 
field. Even within programs of a single sponsor there often was wide 
variation in different sites, so that a Bankstreet College program in 
site X might look more like an Educational Development Center program 
in site Y than it did like another Bankstreet program in site I* 

This confusion has given rise to an entirely new field of inquiry, 
as is often the case when there are unanticipated complications in an 
evaluation and when the research community senses the logic of the sit- 
uation. The new field is called "implementation research"; the object 
is to determine how well the sponsor template— the original program 
configuration created in the lab school setting— is replicated in the 
field. It has been discovered, and is still being discovered, that 
in this new area of research all the problems of an empiricall 
dimensionali2ation reassert themselves at one remove. Criteria are 
needed to decide how well a program in the field matches Its template 
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program and other second generation programs in other sites. This means 
that the goals of each sponsor roust be operationalized and we are once 
again in difficulty. 

Any evaluation scheme studying the differential effects of natural 
s?r planned variations is committed to looking at such effects in the 
context of a weak dimensionalization. Even the staunchest proponents 
of Position 2 are humble about the problem of categorizing programs 
as coherent treatments and understanding how they differ. Accepting 
this limitation, it may nonetheless be valuable to group programs 
along the kinds of dimensions proposed by Bissell (1970) and Mayer 
(1971). or those used in the PVHS evaluation (Featherstone, 1972; 
Smith et ai., 1973). It might even be enough to separate programs 
on only one or two face-valid dimensions, perhaps the ones with greatest 
consequences for program cost or the ones with most promising con- 
sequences for a theory of pedagogy. 

By comparison, Sesame Street is in an enviable position. It is 
a coherent treatment that does not differ from site to site and suffers 
minimally from "noise" as it is disseminated. It is also reasonably 
modest about what it purports to teach. For those who have never con- 
sidered problems of program dimensionalization and implementation, it 
is instructive to think of Sesame Street as an analogue, or bett* r an 
opposite, to a Head Start treatment. 

Th*re is another major problem with a differential effects study. 
Perhaps all programs cannot be evaluated with the same instruments. 
The logic of non-comparable treatments can lead quickly to the position 
that non-comparable outcome measures are required. This problem is 
especially evident in planned variation studies and laboratory-school 
studies comparing more and less "structured" programs. To make matters 
worse, there simply are no trusted measures in many domains, especially 
those of affective development and self-concept, that might enable 
assessment of the goals certain sponsors say they are trying to achieve 
(Walker, 1973). The choice among current instruments is a harsh one: 
Either we must include measures with extremely low reliability and 
validity in the battery or we must exclude them and risk a biased 
evaluation. This problem can never be fully resolved until there are 
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equally valid and reliable measures for all relevant domains of Head 
Start process and outcome. One compromise solution in the meantime 
might be to have a basic battery of tests on which all programs are 
compared, and then allow each program to select one or more additional 
measures that it alone will use, or that all programs will have to use 
at its request. 

A final problem of any differential effects study, pointed out by 
Stodolsky (1972), is that if differences in program effects are found, 
their policy implications often are unclear. If we discover, for 
instance, that the Weikart Hi/Scope program results in large gains on 
measures of general intelligence but that some other program results 
in happier parents, how does this readily translate itself into educa- 
tional policy? Certainly such findings are useful information for 
community groups choosing a new curriculum, but they do not in any 
obvious way inform agency decisions about which programs to support 
in the future and which to terminate. The government does not get the 
kind of information that would enable it to distill the best configur- 
at ions from the initial group of prototypes by successive approximations, 
and the incrementalist strategy envisioned by Light and Smith (1970) 
is not readily fulfilled. 

Despite these complications and disadvantages, Position 2 may be 
the most reasonable to espouse in a new Head Start evaluation. It 
remains attractive for three reasons. First, more from a political 
than a measurement standpoint, a differential effects design justifies 
looking closely at a subset of the best or most clearly defined pro- 
grams. If we are interested in differential effects, we must be as 
clear as possible about the treatments compared. This probably leads 
to selecting program types that have had some identifiable dimension- 
ality and some measurable effects in the past. We have just finished 
investing ten million dollars over five years in PVHS to learn some- 
thing about various sponsored Head Start programs and their effects 
on different grour of children. From that study facts were gathered 
about which programs were most readily implemented, which achieved 
the best effects on a range of outcome measures, and which differed 
most from each other. We might, as Fred Mosteller has suggested 
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(personal communication, September, 1973), construe the PVHS study as 
a preliminary, hypothesis-generating field venture, not high enough, in 
its standards of scientific rigor to be called an experiment but leading 
to various initial ideas that should now receive more careful 'study in 
a controlled field experiment with true randomization of subjects to 
treatments. PVHS results should not be thrown away; the ne^tt evaluation 
should build on what" has been learned. PVHS data should be used to 
generate a much more limited and careful design, asking more fine- 
grained questions. This approach no doubt would appeal to 0M3 and 
other sectors concerned about cost-effective use of evaluation results. 

Second, and equally important, it seems only fair to assess Head 
Start in terms of what a good program can accomplish, on the grounds^ 
that once this baseline is established, dissemination can follow. f In 
few areas of the federal government are programs justified on the basis 
of performance estimates taken from samples of performance under aver- 
age or randomly sampled conditions. It should be sufficient to demon- 
strate that within certain budget constraints it is possible for some 
programs to have good results in field testing; These programs are 
apt to be those that have received most care in their design and 
formulation and are most ready to be implemented. 

Third, it can be argued that a differential effects design enables 
pooling and therefore also enables us at a second level to answer the 
question of aggregate effects. This is in contrast to a general effects 
design, where if there are only slight aggregate gains we are never 
certain there were not strong selective gains in certain programs. 
This research wisdom can be combined with a parallel bit of political 
wisdom: If an evaluation has slight generalized effects as its primary 
finding, chances are it will have a negative influence on attitudes 
about the program. If it has large selective effects, policymakers 
may regard the entire program positively. It is important to consider 
political effects and make prior estimates about where findings are 
apt to be sizable. 

Finally, two other practical questions can be asked in a differential 
effects study that cannot in a general effects study, both of them of 
interest to the policymaker; Which programs have fairly robust effects 
across different child groups, and in each program, which clusters of 
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Head Start ohiectives can be attained Jointly? The latter question has 
never been explored. Every Head Start program has a list of goals in 
the domains of cognitive development, health and nutrition, partut parti- 
cipation, and so forth. It would be interesting to attempt a factor 
analysis of sorts, trying to figure out which goals tended to be attained 
independent of each other, which went hand in hand, and which were 

mutually exclusive. ^ 

What might a new differed *al effects study look like? Whatever 
the design it would need to be small, careful, and unpretentious. Two 
approaches suggest themselves, one of which would be interesting if OCD 
wanted to stress cognitive effects and their relation to program cost, 
the other if cognitive effects were a secondary consideration in the 
evaluation. The first study can be described as an analogue to a crop 
fertilization experiment, in deference to R. A. Fisher, with the hope 
that equivalence between Head Start treatment and fertilizer treatment 
will not be misunderstood. The study would ask the same question the 
agricultural agent asks when he plants a field with a single strain of 
wheat and then fertilizes each third of it differently. The first third 
receives no fertilizer at all, the second receives an average dosage 
(low-cost), and the third receives an intensive dosage (higher-cost). 
• Does the intensive dosage merit the additional money, and in general 
does dosag3 seem to matter? By analogue the Head Start research design 
would have equal numbers of sponsored programs, randomly selected tradi- 
tional Head Start programs, and non-Head Start controls. Questions 
would be those of cost and value added. 

Such a design has a number of merits. First, it speaks directly 
to one question policymakers want to ask. They do not really want to 
know Whether Head Start is succeeding, because if they look at the 
data they know that a rather good prima facie case can be made-as good 
as in most other national evaluations-that some Head Start programs^ 
are succeeding and others are not. Instead, they want to know what it 
takes to put a good program in the field and what is the magnitude of 
predictable difference between a well-executed but more expensive 
program and an average, lower-cost program. These questions are natural 
ones for the economist or cost-benefit analyst. A three-part design 
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comparing a sponsored program with traditional Head Start with non-Head 
Start controls would enable us to begin to answer them. 

But there is also a problem with this idea. We must choose a sin- 
gle sponsor, or a very few sponsors, to represent the "Mgh-cost" treat- 
ment. This means first of all that difficult judgments must be made 
about what is going to be called a coherent treatment— a single set of 
programs whose phenotypic variation in the field is not so^great that 
they are no longer identifiably based on the same parent program. This 
could mean relying on a limited set of sponsors without all program types 
represented. One approach would be to explore in more detail the effects 
of the or >r two programs that looked most promising, or had the most 
pronounced effects, in the PVHS study. The new study might attempt 
to learn more about suocess-related aspects of these programs that could 
be generalized or exported to other programs; it might also compare the 
programs with less expensive programs to gather baseline data on cost 
and quality. In such a study there would be no need to further dimen- 
aionalize centers, since the level of analysis would be the sponsor 
and not the type of program. OCD might choose as high-cost variations 
one structured program (e.g., weikert Hi/Scope) that showed cognitive 
gains in PVHS, and one good program emphasizing social-emotional 
development (e.g., Bankstreet) with effects that probably were not 
given a fair chance by the PVHS battery. This kind of study would be 
a "mini-planned variation study," but with a new emphasis on the relation 
of costs and effects. 

The other kind of experiment that comes to mind assumes much more 
limited interest in cognitive effects, using them perhaps as one of a 
number of variables in a design of managerial program variations. The 
OCD has recently initiated the Improvement and Innovation program, 
according to which all Head Start programs in the field must choose one 
of the following configurations: 

Standard Head Start, center-based, five days per week. 
Variations in center attendance for individual children- 
varying hours of the day and days of the week. 
Home-based model. 



1. 
2. 

3. 
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4. Double sessions; two classes per day in a center. 

5. Various locally designed options. 

In addition there are a number of experimental programs or demonstra- 
tions-Home Start, Parent-Child Centers, Child and Family Resource 
Centers, programs for the handicapped, and a proposed demonstration in 
the area of "developmental continuity"— that will explore the arti- 
culation of preschool with elementary school programs. 

Assessment of these program variations according to managerial 
criteria might be the most sensible evaluation strategy, with cognitive 
effects a secondary consideration. Children's attainment of minimal 
sufficient cbgnitive benefits might be compared, for instance, in centers 
with regular attendance, centers with variable attendance, and home- 
based programs. 

o Position 3. It is sufficient to demonstrate strong 
cognitive benefits for some Head Start children. 

Another kind of evaluation strategy would try to determine which child 
subgroups were benefitting most from Head Start experience, either in 
randomly selected programs or in certain sponsored programs. There is, 
for instance, a. line of evidence in the preschool research literature 
(Bissell, 1970, Karnes, 1973, Weikart, 1967, 1972) suggstive that the 
principal benefits of preschool experience may be for^ftildren with 
Slanford-Binet IQs of 80 to 90, a full standard deviation below average. 
Many studies suggesting highest effects for this group are thrown into 
question because of inadequate procedures for controlling regression 
to the mean from pre-testing to post-testing, but in a least one anal- 
ysis involving the PVHS data (Smith, personal communication, September, 
1973) it looks as though such effects may be real even with proper 
statistical adjustments. In addition, preschool advocates like Weikart 
point out that many children enter their programs "at risk"; without 
preschool experience they would be likely to end up assigned to classes 
for the mentally retarded (MR) in elementary school. After preschool 
these children may show a lower rate of MR class assignment. It would 
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be impressive indeed to demonstrate that Head St irt children of low IQs 
were less apt to require costly attention in elementary school than 
children of low IQs without Head Start. There may also be other sub- 
groups for whom Head Start offers special benefits, such as physically 
handicapped children or children below a certain level in Standard 
English fluency. 

One obvious question comes to mind: Why not combine a program 
effects and child-group effects study and do an evaluation principally 
intended to explore the interactions of program type and child subgroup? 
Helen Featherstone (1972) explored interactions in the PVHS data, and 
her work leads to a number of tempting hypotheses for further investi- 
gation. The answer to this question, I believe, is that even though 
interactions will be important to explore in any evaluation, it is pro- 
bably not advisable to attempt an evaluation focusing in the first 
instance on them. This would require a sample of Head Start children 
differing in its subgroup proportions from the actual Head Start popu- 
lation, and it would necessitate a fully crossed design, which might 
prove impossible or greatly at variance with naturally occurring com- 
binatiois of programs and child subgroups. 

Another important variant of Position 3 has been espoused princi- 
pally by B. White (et al., 1972; 1973) and others concerned about sensi- 
tive periods and optimal times to intervene in the child's early 
development. The central question for this group of researchers, and 
the one they maintain should be central for policymakers and well, is 
tfhen to involve the child in a preschool program. Assuming a fixed 
amount of federal money available for preschool programs and elementary 
school programs, it may be the case, for instance, that the most important 
{ od to reach the child is not during the Head Start years at all but 
from 12 to 18 months. Burton White believes that by the time a child 
is four, when most children enter Head Start, it is too late to have 
much effect on cognitive and language development. It is also too late, 
he feels, for cost effective identification and treatment of basic 
deficiencies in sight and hearing, and other screenable developmental 
problems. Other maintain that early infancy is the most important 
time to intervene, and still others (e.g., Bereiter, 1972) have come 
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to feel there is nothing done for a child's cognitive development in a 
preschool program that could not be done as efficiently or better in the 
first year of elementary school. Many issues surrounding the relative 
costs and benefits of federal intervention at various age levels have 

yet to be resolved. 

For present purposes we might limit the question somewhat, asking 
whether the current Head Start program is as effective with four year 
olds as with five year olds, or more effective. Within the two year 
age span of the Head Start child population, which children are benefit- 
ting most? There is ample evidence from the PVHS study that four year 
olds show fairly high gains in the more successful programs but do not 
continue to gain so dramatically in their second Head Start year if 
they remain in the program. This kind of information might be weighed, 
along with information about effects for five year olds entering for 
the first time. 

It is appealing to argue that the government is committed to 
spending X dollars on educational programs for chilc«~en and that the 
real question is Che cost-beneficial one of when that money should be 
spent. But this argument has one glaring problem: To compare the 
effects of programs at different age levels we have to be able to 
compare assessments across years, which, as all psychologists appreciate, 
is extremely difficult. Imagine the nightmare of trying to compare 
average gains on the Shaefer or the Bailey in an infant program with 
average Stanf ord~Binet gains for the same group or a comparable group 
at age five. This recalls a point underscored years ago by Kagan and 
Moss (1962) — there is often little comparability of phenotypic behavior 
patterns from one age level to the next on a given dimension of personal- 
ity, cognitive ability, or achievement. We might add that at different 
age levels there is apt to be even less comparability of program-related 
changes in behavior. Competence in a particular domain is reflected 
differently at different ages, and a theory of mental development and 
mental process is needed to link earlier behavior patterns to later 
ones through some enduring dimension of mental ability or some latent 
trait. 
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Theories of development can differ greatly in their implications 
for policy decisions. As an example, if a Piagetian or Montessorian 
theoretical framework is adopted, then trotor acts at one age level are 
believed to instruct verbal and perceptual ones at a higher level. 
This means that we would seek some equivalence between ability with an 
embedded figures task at age four, for instance, and perceptual form 
discrimination at a later age. Most of us would agree that such equi- 
valences or links probably do exist , but it would be impossible without 
a more sophisticated knowledge of mental process to devise a test 
battery for four year olds that taps the same latent dimension as for 
seven year olds. Even on the Stanf ord~Binet » which has various forms 
for various age levels, there are serious problems. First, the test 
is largely unreliable for children younger than five or six, and its 
predictive validity is notably lower for this group than for older 
children. Second, it is generally acknowledged to measure different 
factors of cognitive performance at different age levels, and consistency 
of score is more the result of a heuristic process of item refinement 
over the years than an indication that the same dimension of mental 
ability is being tapped across test levels. 

Some statements probably can he made about the best times for 
economically identifying and curing such gross neurological impair- 
ments as poor eyesight and hearing. This is the area of intervention 
in which an age-related evaluation looks the best. For instance, it 
would be interesting to know the intersect of two curves plotted on an 
x axis of age and a y of cost, one presumably descending and the other 
ascending, the former being for cost of diagnosis and the latter fcr 
cost of cure. Bur primary concern in the Head Start evaluation has to 
be with effects of educational programs, not early detection of physical 
disabilities, and in this domain age comparisons of treatment effects 
are apt to be too difficult to pursue. 

o Position 4. Cognitive effects should be a moderator 
variable. It is necessary to demonstrate cognitive 
effects for some Head Start programs and children, but 
such effects are not among the most important evaluation 
outcomes. 
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This* position assumes that in the forthcoming evaluation there should be 
a more modest role for cognitive effects measures. The coming evaluatioi 
is of social competency, broadly defined, and consumers of the research 
are not going to expect a return to major emphasis on cognitive instru- 
ments. Instead f cognitive outcomes will serve as moderator variables 
of sorts, necessary but not sufficient to demonstrate that programs 
are accomplishing something. They will be allowed to fade into the 
background to the extent that other measures in other domains can be 
presented with high enough reliability and validity to command respect. 
By this logic, for instance t evaluation might administer one or two 
well-established cognitive instruments to the entire sample or to a 
randomly selected or stratified subsample. This would replicate what 
had been done before , indicating whether programs were doing as well 
as previous ones according to traditional criteria, but it would leave 
major emphasis and the burden of proof of positive effects on other 
kinds of instruments. This is a "maximum" strategy: Guarantee that 
a trustworthy baseline of moderator data is provided by a modest 
cognitive battery, and then offer a set of more high-risk, high-gain 
measures as the principal evidence of program effects. 

The wisdom of this strategy cannot be assessed without knowing 
how much we are going to be able to trust new instruments In non- 
cognitive areas, and whether OCP will have sufficient time to develop 
new instruments. These issues need to be clarified. 

o Position 5. We do not have the measurement technology at 
present to assess Head Start's cognitive effects. 

A fifth position is mentioned briefly here because it is a kind of 
null-hypothesis, representing the views of certain skeptics in the 
educational research community and the community of psychologists. 
This position holds that it is folly to mount any new Head Start 
evaluation at all right now. Instead, we should return to basic 
research in program dimensionalization, test design, and quasi- 
experimental methods. We still do not know how programs differ 
from one another; we do not know how to measure their effects very 
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accurately In :\\- domain and can't measure then at all in borne ; and wv 
have not vet been able to mount a full-scale natural experiment without 
many confounded variable* and vioiat ionsoithe rudimentary canons of 
good research. Perhaps the eva^uwrtTons conducted so far have done 
nothing but squander the taxpayers 1 money , resulting in mistaken or 
confused inference about program effects, not helping policymakers at 
all in deciding whether to continue th* program, terminate it, or work 
to strengthen certain parts. 

These issues deserve a paper by some ardent psychometrician, policy 
researcher, or planner. But for practical purposes the position is 
not very appealing; it seems extreme to maintain that Head Start cannot 
hi- evaluated because of a lack of adequate evaluation technology. The 
predictable reply is that we must do the best we can with a difficult 
assignment, forging a new evaluation mindful of the vicissitudes of 
field-based educational research. The point of view that "more basic- 
research is needed" is find for scholarly journals, but it does not 
help decisionmakers unless we are honestly prepared to say that we can 
learn 'f»r\/ from an evaluation. Most of us would stop short of 
saying that* 

WHICH P0S11 1ON S H0l T U> BK AIX)P'l KP ' 

Choices about a sufficient demonstration of cognitive effects and 
the weight placed on the cognitive effects battery in the overall design 
of the evaluation deserve careful consideration. Which position shall 
be espoused? Much of the subsequent discussion of cognitive effects 
measurement is contingent on the choice 

Current orthodoxy and the need for formal evaluation criteria 
have led many policymakers to espouse what 1 have called Position 1- 
it is safe to assume that this position remains the dominant precon- 
ception: (1) Head Start must demonstrate generalized effects and (2) 
Head Start is on much firmer ground if it can demonstrate effects that 
maintain themselves over time. It would be impossible to overlook 
these two success criteria completely in any forthcoming evaluation. 
Despite this, it also would clearly be a mistake to design an evaluation 
with either of these questions as the yrimrif one addressed. To do so 
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would a J rim-, t surely ^.iranU'i' that Head Start did not hive a fair chain r 
in the evaluation. We know that overall effects are ap,t to he minimal 
even if statistically significant, regardless of measures employed. We 
also know iro- the bete regent? i t v of Head Start programs that any reailv 
interesting or systematic effects probably will nor be universal. 
Finallv, longitudinal effects are apt to he slight and expensive to 
trace t-wn in the Dest Head ^cart programs. 

The concern of policymakers with thest- two criteria should be ack- 
nowledged, putting the criteria rorward candidlv in the new evaluation 
proposal. But Aeneralized eiierts and longitudinal effects should he 
show to he it I c* '■;>:■ interest. Primary interest should be elsewhere, 
perhaps in examining cicely and mure systematically than before the 
effects of limited sublet* ot program* or the efffef* of program* on 
limited sublets ot part irl pint children. The evaluation Cuuld adopt 
as its principal strategy s^e variant of ei .her Posit i 2 or Position 
J. yhi'i i-; in keeping with the proposition that PVHK was a preliminary 
exert i-.e in hypothesis ^naitt-m, and now we are reiidy tor >ne or norc 
caretnliv executed, smaller **-ale social experiments—real experiments, 
perhaps in the sen.iv that ,hi*.dnn are randomized t- program t ype* t or 
# . , . . . . ... . r. «r v • _ ,t » i , |d r • *r. h des ijzri are 

n.-t overlooked and ba -i. i are in te?,t administration I-., not torgotten. 
The design c-r such a studv. whatevet it vx.filois, iw e d ; to be sisal: and 
weii-conL roi led without a L.o.->t ol sponsor* , rr.orrceu: saCfple size before 
inference-; can he made, and tontouruied independent variables and un- 
crossed IcVeli ot the design. Keeping the tftudy snail and elegant will 
enhance irs credibility Immeasurably. 

j. ,ii |) a ■■opt-; instead s.-r-.e version ^ Peltier! iia-n in the 
ti.piittve de-am we need e*nly r^urr^t sort of the !-n-i« reliable indi- 
vidual J v administered pea*? tires from p.*st Head Start evaluations, make 
a )u ( ) R -vnt ah -:t how r.anv hiidren ^.culd be tested, and t rv to do the 
testing . arfful.y than before. Visit top * implies .in emphasis 

on now non-t ogni t i ve measurfH «md frea*aure . .oit&iie the domain ot 
1 / f r* i t i v** {••- r t » s r"» tn* . 
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111 • TH >1 M L _ E -^ F C 0GNITIV£ EFFECTS MEASURE S 
IN THE HEAD START BATTERY 



In pat»t Head Start evaluations, heavy emphasis on individually 
administered pre-tests and post-tests of cognitive performance has 
left many observers with the impression that the tail was wagging the 
dcg. Evaluators, looking for any measures for young children with 
enough validity and reliability to be respectable in traditional 
psychometric terras, have returned time and again to the Revised Stan- 
ford Binet, subtests of the Illinois Test of Psycholinguistic Abilities 
(UFA), and the Peabody Picture Vocabulary Test (PPVT) as measures 
nf cognitive ability, along with certain achievement tests more 
directly assessing short-terse Head Start learning (e.g., the Wide 
Range Achievement Test (WRAT) , the Preschool Inventory (PSI), and 
the Deutsch NVV Booklets 4A and 3D). 1 Criticisms have been made of 
the weight su h tests have received in all major Head Start evalu- 
ation*; nanv reel that the\ do not fairly tap what Head Start programs 
are trying to ar map Man, even vithin the cognitive domain. :ome pro- 
gram* are a:rt concerned with motivation and cognitive process than 
with -.. Kll tr»vf r !!„rir^e. Others are presussably slighted berause 
their rurric tilua does not "teach to the tests" or teach to the specific 
domains of . ompetente tapped i ■ the tests. 

The rhoice to weigh cognitive effects heavily has been made largely 
h y default, <K-t because researchers thought cognitive instruments were 
w ,re important or had higher validity in any absolute or theoretical 
»en«e, bur because other measures, including those in the areas of 
jil fntlv* growth, nut ivat ional or attitudinal change, and classroom 
behavior art? usual lv so poor and of such low reliability that they 
• annot taken seriously. This fact has not changed much in the past 

l Teats mentioned in this and the subsequent section will not be 
referenced separately as long as they appear in Walker, Ban«, and 
Kryk's PVHS test summary (197?). Copies of the tests and manuals in 
the form used in the PVHS evaluation can be obtained from the ERIC 
< iearinghouse for tests, measurement and evaluation, Educational 
Testing *-rrvtre, Frtneeton, N.J. 08540. 
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eight years despite the heroic efforts of various test developers to 
devise measures of non-cognitive effects and observational schemes en- 
abling a departure from pap»r and pencil or individually administered, 
clinical testing. For an excellent review of non-cognitive measures, 
readers are referred to Walker's 1973 book on the subject. 

Two individually administered tests of cognitive eifects have 
dominated the major evaluations of Head Start: the Revised Stanford- 
Binet (Form L-M) and the Preschool Inventory (PSI), developed by 
Betty Caldwell in 1965 especially for Head Start as a criterion-ref- 

f 

erenced measure of school readiness. The PSI was reduced from 64 to 
32 items in 1969 to facilitate Head Start testing by paraprofessionals 
in abbreviated testing sessions. A third instrument included in the 
PVHS evaluation was the Deutsch NYU Test Booklets, two of which, the 
4A and 3D, are straightforward achievement measures with fairly high 
reliability in Head Start measurement situations. 

It is valuable to look at these tests to gain a notion of the 
"state of the art" in measuring Head Start's cognitive effects. Here 
they will be considered exemplary and typical. A fuller summary of 
cognitive measures can be found in excellent ETS and Huron Institute 
volumes (Educational Testing Service, 1968; Walker, Bane and Bryk, 
1973). 

Three general propositions should be considered. First, the eval- 
uation of cognitive effects may not be the mos^^sniportant goal of a 
new Head Start evaluation This issu^i relates to Position 4 in the 
previous section. It is de'eply engrained in the tradition of Head 
Start that cognitive gains -are a good basis for policy decisions, but 
few of those who care most about Head Start care principally that a 
child gain five points on an 1Q test during the Head Start year. Most 
are far more concerned that children have a socially exciting experience, 
that th^y get involved with other children of comparable age, that 
they prepare themselves emotionally for school, that they be proud of 
themselves and their own capabilities,/ that they be aware of various 
facts and entires in thei r physical and social surroundings, ^and so 
forth. Indeed, for the most part those who have found cognitive gains 
a predominant criterion of program success have been researchers 
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worried about reliability and ^olicymakeis worried that the program is 
not going to lift children out of poverty by giving them a boost toward 
average or better school achievement and the later benefits they assume 
will flow from this. Arguably the evidence of the past several years 
points to how silly the researchers and policymakers have been in this 
emphasis and. how right the parents, providers, and other taxpayers have 
been to ignore or protest it. Cognitive gains may simply not be that 

ft 

important. The coming evaluation may need to assert this and tell 
policymakers why the evaluation needs to be recast. 

Just as in basic psychology, where a theory is replaced only when 
a better one comes along, there is a need to offer measures in other 
domains— notably health and nutrition, social competence, and influence 
on the family— that match the cognitive measures In credibility while 
supplanting them in importance. This credibility need not involve 
levels of psychometric validity as high as would be demanded in a 
laboratory setting. Face validity is sufficient. But they do require 
that measures be reliable and that measurement be feasible in field- 
testing situations. In addition, if new instruments are to be used, 
the team finally responsible for analyzing the data— those who will 
write the final report— must be involved with the evaluation early 
enough to agree to the idea of weighing these measures heavily. In 
other words, it should not be allowed to happen as has been the case 
in the past that those conceiving che evaluation are an entirely 
different group from those analyzing the data, with a different con- 
ception of which instruments to stress. 

In selecting instruments, it is also important to realize that there 
is a difference between psychometric validity and political validity. 
This difference helps explain, for instance, why the Stanford-Binet 
has repeatedly been chosen as a Head Start outcome measure. No 
psychologist who knows the Binet and also knows what Head Start is 
trying to accomplish feels that using this test to evaluate the program 
makes much sense. The instrument was designed to measure a unitary, 
stable trait of general intelligence, not to measure program-related 
achievement or increases in performance in specific realms of cognition. 
Items are not samples from larger pools representing theoretically 
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coherent dimensions of mental ability, there is no subscale structure, 
and the predictive validity of : :.iim on the test, as against measurement 
in the one-shot testing situation, is unknown. 

A measure of Head Start's cognitive effects ideally would be quite 
different, telling us (1) which dimensions of cognitive performance 
Head Start was able to influence, (2) whether gains on these dimensions 
had face-valid importance or predicted to better than expected outcomes 
in later schooling and later life, and (3) whether these "leverageable" 
dimensions, on which Head Start could have some effect, could be linked 
causally to specific curricular components of the Head Start pro* 
gram. This means, as a fanciful example, that it would be nice if 
we knew gain in the area of digit-span memory was one of the effects 
that Head Start often had; that such gain resulted in some benefits 
during kindergarten or for the Head Start child's immediate life before 
kindergarten (for example, it generalized, enabling the child to ex- 
pand short-term memory in a number of other realms by a new chunking 
strategy); that a gain in this area predicted to greater competency for 
the Head Start child in later schooling; and finally, that one area 
of the Head Start curriculum, in this case a specific set of structured 
drills, taught this particular skill. Knowing that much, we would 
indeed be on the track toward a theory of instruction. We would have 
some idea of how to appraise cognitive gains. As a less ambitious 
goal, even if we knew nothing about generalizability of acquired skills 
or about how they predicted to desiderata in later schooling and later 
life, and even if we did not know exactly which aspects of the Head 
Start program caused shifts in cognitive performance, it would be 
enough to show that some face-valid gains in areas of obvious practical 
interest could be effected and then to characterize these areas. 

The Binet does not enable us to talk about any of these things. It 
certainly is not a good measure of Head Start effects in the ambitious 
sense outlined first. We know nothing from it about dimensions of 
cognitive gain, since items on each test level are not representative 
of various domains of cognitive performance. Nor does the test tell 
us anything about the predictive validity of gains. In fact, since 
the instrument was designed .u ...easure a stable latent trait, any gain 
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arguably must be interpreted as a reflection of low test reliability. 

The Binet also is not good in the more circumscribed sense of a 
criterion-referenced achievement measure. Ita items are not intended 
to tap skills that Head Start teachers feel are the most important 
ones to teach, and they often have little apparent connection with 
actual kindergarten-related skills or first grade skills. Indeed, 
they are chosen to measure something that is not teachable — a perman- 
ent characteristic of general intelligence— rather than skills that 
can be readily acquired. In addition, of course, the Binet is cultur- 
ally biased— it was designed for a middle-class white population and 
normed on this population. To use it on Head Start children — and to 
be oblivious of its differential validity for different groups by 
geographic region, ethnicity, and so on— is to ignore test aspects 
that Terman and the other designers of the test would never have over- 
looked themselves. 

Why, cnen, have preschool evaluations persisted in using the Binet? 
The answer, I think, is that the test has political validity; that is, 
it has a certain credibility among researchers and policymakers simply 
because it is known (by name, if not by psychometric pedigree) and 
has been used traditionally in assessing the intelligence of young 
children. An IQ "gain" has a mystique about it— it suggests that 
one fixed level of intelligence, or "g," has been replaced by another. 
As we know, this interpretation is largely spurious. But it is the 
public notion, and it is firmly enough entrenched that many researchers 
and others turn to the Binet almost reflexively rather than fight the 
more difficult battle of trying^to^explain to an audience of non- 
psychologists why the test is inappropriate. In addition, of course, 
the Binet is administered by trained testers, many of whom exist 
around the country already, and although it is more expensive to 
administer than most other tests in past Head Start batteries (the 
PVHS evaluation could afford to give it to only half the sample), its 
level of reliability in very short-interval test-retest situations 
and its inter-rater reliability have probably been higher than for 
other tests (Walker, Bane and Bryk, 197?). This too has tended to 
give it credibility. 
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Political validity is important and should not be ignored al- 
together. But in a future evaluation it is probably important to 
educate policymakers and others about the inappropriateuess of certain 
time-honored instruments when these instruments are applied in the , 
context of Head Start evaluation, rather than cater to predispositions 
about the tests that "really matter." 

PROBLE MS WITH THE PAST MEASUREMENT OF HEAD STARTS 
C OGNITIVE EFFECTS 

It is useful to summarize certain recurrent shortcomings of past . 
cognitive effects batteries. None of the problems mentioned here are 
easy to remedy; perhaps some of them are inevitable* given the limit- 
ations of current instrument development. But all of them are likely 
to recur if no special efforts are made to avoid them. 

1. Low quality of the field operation for test administration 
and other aspects of data collection. This problem is first on the 
list. It cannot again be overlooked without serious consequences. 
In past Head Start evaluations, even the most rudimentary aspects 
of data collection have gone wrong. We have tried to collect too 
much data for too little money, and the results have been appalling. 
Those who have worked with the data have never been sure which test 
results could be trusted, even among those that should be most reliable 
and reliably administered. Examples of oversights abound; it may be 
useful to mention a few: 

o Teat-retest and inter-rater reliabilities for adminis- 
tration in the Head Start setting generally have not 
been reported, and in two instances where such infor- 
mation has been gathered, the ETS Longitudinal Study 
and the Huron Institute reliability study conducted as 
part of 1971-72 data collection (see Walker, Bane and 
Bryk, 1973), results have been unacceptably poor on 
many cognitive measures. 

o In the PVHS study, some tests were administered by 
trained and specialized testers (e.g., the Stanford- 
Binet, the 8-block Sorting Task), but most were 
administered by community paraprofessionals who re- 
ceived only a short briefing in how to give them 
(e.g., the PSI). Many of the testers were not 
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uniform in techniques for establishing rapport with 
children, presenting test materials, or scoring 
children's responses. 

o On the PSI and most other cognitive measures, identical 
forms were administered at pre-test and post-test, with 
no alternate forms or item sampling procedures. Practice 
effects seem likely, and it was fairly easy to teach to 
the test. 

o In some PVHS sites, there was a single tester pre and 
post for certain children and different testers pre 
and post for other children. For instance, in one site 
on the Binet, control children had the same tester pre 
and post and experimental children did not. It was 
also noted that experimental children had unusually 
low pre-test scores in comparison with controls. 
Since the children were from the same preschool popu- 
lation, this suggests that perhaps there were selec- 
tion effects or unreliable pretestings. How should 
such data be interpreted? 

o Until quite recently there has been no scoring of 
individually administered tests for response style. 
Now at least the Hertzig-Birch scoring scheme has be- 
come part of the PVHS and ETS studies. But most 
testers in the field, especially those administering 
tests other than the Binet, have never been trained 
to code response style according to the Hertzig- 
Birch scheme or any other, and without sufficient 
training for testers this kind of coding is apt to 
be of low inter-rater reliability. Much of the data 
collected so far, while suggestive, may not be worth 
analyzing because it is of poor quality. 

Problems like these are apt to occur for a very simple reason. At 
the outset, those conducting the evaluation have the best of intentions; 
a limited test battery is selected with some efforts made to ensure 
reliable test administration. Then for political or other reasons, as 
the evaluation progresses a new measure simply must be included in the 
battery, or a new site simply must be added, or a deadline for initial 
test administration simply cannot be met with enough time for training 
of testers. The integrity of the field operation is gradually under- 
mined by decisions subsequent to the original plan for data collection. 

One does not have to be an organizational theorist or have any 
experience with larg^-scale data collection efforts to know this much: 
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If an evaluation does not collect good data no one will believe what- 
ever it concludes. This means that we should look carefully at the 
Sesame Street effort and other field assessment efforts of high quality, 
trying to emulate them. Probably we should estimate initial sample size 
on the basis of higher than anticipated cost estimates for test admin- 
istration, and then multiply that estimate by two or so to get a fair 
approximation of real cost! 

2. Poor theoretical rationale for the individually administered 
cognitive tests in the Head Start battery. Any adequate treatment of 
this probletu could fill a good-sized book. I will only try to spell 
out some unresolved issues. In general, psychologists put all of 
these issues under the rubric of validity questions. In the present 
context we are not concerned, as we were above, with tactical questions 
about sufficient magnitude and duration of effects. Instead we are 
concerned with "truth" questions about what we as researchers have 
actually demonstrated when we find an effect, usually in the form of 



a transition from time l[to time 2 in children's performance on an 
individually administered test. I will list some persisting confusions 
in the measurement of Head Start effects that can be traced, I think, 
to confusions surrounding the theoretical rationale for the instruments 
themselves as they are applied in the context of Head Start. 

The first problem is what might be called the assumption of initial 
inccrrpctcnce. This is the notion that Head Start children begin at 
a leve 1 of cognitive functioning that is somehow inadequate, "deprived," 
or ignorant and progress to some level of competency. Robert Hess 



arguing that implicit in most people's thinking about compensatory 
programs is some notion of a mental state pre and post—some concept!* 
of what deprivation and non-deprivation look like. Most such concep- 
tions are based on an operational definition of competence, publicly 
defined, often related to performance expectations in the schools. 
Hess points out that these models may all be wrong or bigoted, a 
point also made convincingly by Cole and Bruner (1972), Labov (1972), 
and others. These researchers suggest that we often err because we 
do not begin as anthropologists, assuming a position of cultural 



(1969) has created an interesting taxonomy o 
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relativism. The child's mental state pre may be just as sophisticated 
as his mental state post; the only change brought about by Head Start 
may be to introauce him to a set of role e-rec tat ions, norms, patterns 
of acceptable verbal conduct, and so forth that prove adaptive for 
him in getting from his own cultural context to that of the school and 
the so-called dominant culture. His versatility is increased, not 
his capacity. 

Most of us are familiar with this point and I will not belabor 
it, except to ask that we consider its implications for the cognitive 
effects battery. It suggests that perhaps we need a new and pluralist 
conception of the appropriate end-point for Head Start activities; for 
different cultural groups, different sets of goals may be appropriate. 
In the past evaluations we have avoided this issue because it has 
seemed to lead rapidly to a test battery comprising culturally unique 
and non-comparable measures. This remains a danger, but in the past 
we have gone too far in the other direction. The selection of tests 
and the development of special Head Start tests have made the assump- 
tion of a uniform initial competence. None of the tests has been able 
to tap culturally relative patterns of mental performance at either 
pre-test or" post-test. The Binet is notorious for cultural bias, the 
PSI was developed explicitly to be culturall biased on the theory 
that this was the fairest way to assess level of preparation for 
middle-class school situations, and other measures also show no parti- 
cular ability to tap skills that a child may bring to Head Start. 
There is much to say for choosing measures— or developing new ones 
that ask how skills the child brings to Head Start become transmuted 
over the year into skills he can use at school or in cultural contexts 
outside his own. 

If we want to base tests on the cultural relativist modex of cogni- 
tive development and program effects, one way to proceed would simply 
be to look at interactions of child group, test, and program type, 
as Featherstone (1972) and others have done. This is certainly a 
step in the right direction, but it does not attack the problem at 
the level of the tests themselves, exploring whether magnitude of 
shifts in score on a single test can ever be a fair yardstick of 
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program success for various cultural groups. 

Another confusion surrounding theoretical rationale is lack of 
clarity about the relative importance of cognitive-developmental ae 
against behavioral Head Start goals . A particularly interesting ex- 
change on che question of appropriate cognitive goals for preschool 
programs took place in Interchange *n 1976. It was between Lawrence 
Kohlberg and Carl Bereiter, with Kg. .berg arguing the position of the 
stage-sequential Piagetian and Bereiter the position of the behaviorist. 
The discussion has direct bearing on the question of theoretical 
rationale in choice of tests and test construction. Bereiter tried to 
make the case that there was no point in measuring anything but face- 
valid changes in skill levels and other readily perceptible dimensions 
of cognitive performance that would be adaptive in school, because we 
simply did not have an adequate theory of intellective functioning or 
intellectual development to allow us to see other kinds of changes, 
in this case the attainment of concrete operations or the extension 
of concrete operations into some new domain, as an Important achieve- 
ment. Bereiter maintained that on theoretical grounds the Kohlberg 
point of view was suspect because the child presumably would attain 
concrete operations anyway sooner or later and there was no point in 
hastening the process, even assuming it could be hastened. He also 
maintained on empirical grounds that we had no valid or reliable 
measures to tell us when a child has successfully extended his capacity 
for concrete operational thinking into new rearms. 

Kohlberg responded that from the standpoint of cognitive develop- 
ment, the kinds of "gains" Bereiter was left with as a residue were 
trivial after he eliminated all that he believed/ could not be discussed 
because of inadequate theoretical rationale. To teach a child numbers 
and letters, for instance, is not an important enough task to merit 
the efforts of a federal program (especially, he might have added in 
hindsight, since Sesame Street seems to be doing it so much more 
cheaply). This is a specific skill, like learning to swim, and the 
fact that children learn it says nothing about enhanced or gene aliz- 
able cognitive functioning in other domains, or in the future. For 
Kohlberg, face-valid and directly school related skill acquisition is 
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not a sufficient goal for a preschool program. 

Without doing justice to all aspects of the Bereiter-Kohlberg debate 
I only want to suggest that reasonable men differ about whether we 
should have cognitive-developmental or strictly behavioral goals for 
Head Start, and that these differences depend on their own theories of 
development or their judgment that some theories are worthy of in- 
fluencing choice of goals while others are not. The measures included 
in the Head Start battery reflect such theoretical or atheoretical pre- 
dispositions, whether implicitly or explicitly. The point is funda- 
mental: Every Head Start measurement strategy is based on a theory 
of cognitive growth. Thus the educational policy researcher finds his 
own measurement strategy no stronger or weaker than the basic develop- 
mental theory on which it is founded. 

In the past there have been two main implicit biases reflected in 
the measures selected. The first has been theoretical but curiously 
inappropriate and unlike Kohlberg* s— researchers have chosen tests 
like the Binet designed to measure a stable trait of general intelli- 
gence. The second has been wholly atheoretical— researchers have 
chosen criterion-referenced measures of skills directly involved in 
kindergarten and first grade competence. Neither approach has been 
satisfactory, the former because it does not tap any growth function 
of the sort that Kohlberg would emphasize and the latter because 
readiness tests have been too sparse, too culturally biased, and too 
little able to demonstrate concurrent validity in correlating with 
other areas of competence even in the short term. 

There are two directions we should consider in advancing to a 
clearer theoretical rationale* for the tests in the Head Start cognitive 
effects battery. The first is toward theoretically oriented tests, 
which focus on patterns of cognitive growth instead of cognitive stasis 
Some of the Piagetian clinical assessment techniques and Kohlberg 
techniques for assessing ^tage-sequential development and horizontal 
deealage may be worth exploring (see, for instance, Green, Ford, and 
Flamer (1971), and Marcus Lieberman's (1970) thesis on a maximum 
likelihood estimation of stage-assignment for children according to 
performance on various Piagetian tasks). It would also be valuable 

at 
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to develop or select more thoughtful criterion-referenced achievement 
measures . 

A third problem surrounding the theoretical rationale for test 
selection is pel* ia ting confusion about whether we should use criterion- 
referenced or norm-referenced tests* This point has two aspects, 
the first related to the question of what we are trying to measure, 
and the second concerning vhen it is appropriate as a matter of testing 
theory to use each kind of measure. Norm-referenced tests are designed 
to show where an individual child's performance stands in relation 
to the distribution of performances for all individuals in some 
appropriate referenced group. Scores are reported, therefore, as 
they relate to the mean performance of all children at a given age 
or grade level or in terms of a percentile rank. Such test are developed 
by choosing items from a larger pool of face-valid items according 
to intermediate item difficulty, high item-scale correlation, and 
theoretical coherence in the dimension they measure. Items that are 
too easy or too hard are excluded because they do contribute to the 
variance that can be explained by th* test. Criterion-referenced 
tests, 'n contrast, try to compare an individual's performance to 
some set standard— hence "criterion"— rather than to the performance 
of a reference group. The basic idea is to reach agreement on what 
constitutes acceptable performance in some area and then to select 
items from an item pool that either are highly correlated with some 
other direct measure of such performance or somehow themselves repre- 
sent an agreed test of such performance. 

In the case of color recognition, as an example, it would no doubt 
turn *•&»•**■ if a child could identify the colors of four crayons 
chosen random from a box, this would correlate highly with his 
ability to identify colors other than the ones actually selected, and 
In various objects other than crayons. In some cases, passing the 
test itself might be sufficient demonstration of attaining the 
criterion. We might ask the child to Interact with peers in a class- 
room, for instance, which in itself Is identical to the competency 
expected of the child later. This literal achievement of a criterion 
is what Kohlberg (1970) means when he refers to the "industrial 
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psychology" approach in testing. By analogue, if an adult has to 
operate a particular machine , it goes without saying that It Is a 
sufficient test of his ability to sit him down with it and watch him 
perform. In either case, the one where criteria correlate highly with 
competencies or the one where they are the competencies to be domen- 
strated, criterion-referenced items are selected according to which 
ones and how many of them have to be passed before it can be reliably 
predicted that the individual will be able to meet the criterion, or 
perform acceptably. The test is designed with a threshold in mind, 
above which adequate performance can be expected. 

In principle, of course, there can be a rank ordering of criterion 
performance of children from worst to best, and the criterion-referenced 
test can be converted into a normed one. But the idea of an absolute 
confidence level rather than a relative one is quite different in the 
first instance, especially in its implications for item selection. 
Norm- referenced items are chosen first on the grounds of intermediate 
difficulty and scalability, along with face validity. Criterion- 
referenced items are chosen with external validity as the prime 
consideration. According to a non-referenced item selection strategy, 
for instance, a particular embedded figures task might have been in- 
cluded on the Stanford-Binet because it is of average difficulty for 
a particular mental age group and it is highly correlated with other 
items on its six-item age scale. It might also correlate well with 
later composite IQ score and load on the same factor as a secondary 
consideration. But there would be no theoretical reason why it need 
correlate highly, for example, with increases in understanding 
teacher requests in kindergarten or first grade, with specific knowl- 
edge of useful school-related facts, or with other practical aspects 

* 

of cognitive attainment. 

If we want to find out about short-term achievement, perhaps 
controlling for IQ, then we choose items initially on rather different 
grounds. We select them first for external validity— what measures 
directly predict best in the short term or to the competencies pre- 
schools teach? Items selected with this practical purpose in mind 
can then be scaled, but what is desired is a test with high item 
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difficulty i»r r»i ilt-dUtltulr., at pretest and varla^e if t»* dl 1 1 1» ulty 
at post-tost, such tat there is maximum homogeneity ot »<orep at pre- 
test and maximum variance at post-test. This guarantees that our 
•'criterion score" is sensitive dnd that fine gradat * s can be made 
among children regarding whether or n<yt they attain it. 

The P5I is close to being a criterion-referenced test even though 
it has not been interpreted or standardized as s^eh in the Head Start 
evaluation. But it is culturally biased, it is intended only as a 
measure of readiness for middle-class kindergarten, and it does not 
even have any great face validity as a measure of school readiness. 
The items are too few and too arbitrary. Curiously, its value *»a* 
supposedly been legitimated by demonstrating its "concurrent validity"— 
how highly it correlates with the Binet and other tests in thejS#«d 
Start cognitive effects battery. This is precisely how it should, not 
be validated if it is tapping different things from those tapped in 
4 general intelligence measure. We should, I think, follow the lead 
of the Sesame Street test developers and carefully consider a foray 
into unabashed criterion-referenced testing (Bail and Bogatz, 1970; 
ITS, 1974). 

A iourth problem is that there SVe persisting technical difficultly 
,,;;••*,»»}.,; -nuil „>ifi ii w. During the analysis of the PVHS 

study Marshall Smith and his co-workers filled two fat notebooks with 
articles on change scores and how they should be treated, many of 
them contradicting or rebutting each other. The evaluation group never 
did feel confident enough of the issues to select a single technique, 
instead looking for consistencies among outcomes using a nuaber of 
different techniques, litis niade sense from the heuristic standpoint, 
hut it is hardly reassiu-ing for those who would like precise estimates 
of effects. We need a better statistical technology f.or analyzing 
gains with appropriate covariate adjustments, or else we need tests 
explicitl) designed to measure changes in cognitive skill attainment. 
Problems of analysis are clearly related, of course, to the earlier- 
mentioned problems or inadequate developmental theory. 

3. F;: >r theoreli-jai rationale for the observation md interview 
"tr-itturcn in rho Mead Start b-Utery, with the additional difficulty 
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*. ■•■ i»: ; • ■ r ir .' i '"J ^ ;r ? . It the picture 

i«, jji.i.-p-, a :«>«.;>; iLi-tlvUlu.il Iv .iJmnistonj ti sts ot cognitive effects, 

it tfvt-i, ,*.wn* :Ussr.:oni observation and interview techniques. 

Among the cldMAtroem ob^ervaiior measures, there has been a tendency 
U?r nv.»iMtc?».'ty f-> spread trie ir nets wide* attempt inn 10 taptur* all 
that in going on In the 1 laasroom rather than exploring particular and 
llr.lt v4 aspect:-: si ;lass;c:un: pryera*. In s«> doing they usually have 
oaptured very little. Classroom observation techniques need to be more 
%-losely -iri-ht-st rated with individually adniinlstered test* to explore- 
the p*:rtor T nanv*? tognitive competent- ies who#*e capacity is assessed 
in iht- individual testing situation. It is confusing and difficult 
u? dredge < ,r : :»* hypotheses f r«»a data generated by the current 

measures, and walystt. ■„■! the PVMS data have tended to ignore them. 
1 1 -»ois. .>b**-rvers ,iUy have ufien net aet acceptable level* of 
inur - Judge reliability in field testing. 

; i:i i£:jitr and parent interview*, while rTM>r*- reliable, art 1 of lower 
validity a*, measures of Head Start success, since neither parents not 
»*tf-.fd M art are unbiased observers. Blind situations nave 

not Of*';: t-n£i;ieet eu w;tr. i nuepcndetit ubservt-ib L:*. the h*»r»e the 
,en:er, w»'! \ rn.ihie ■ ■■*«np»*ten« v rating* ;ir »-nvimlng interview 

■»»■:;. 4 ! J t;%t r-.ifr.t-nt -n with h i K M t e i i a!» i * 1 1 ->i:.vc 1'i.tn ir-U 



Vi 1*1 tM r ♦ 

v & a *«"■■• t "*» * : as; 



!•■-.;:.;,?. it- ,t ba-: led : :_• a Rrrif^r t^phtsi* *n rnr null 



it - 



je-.lfable- Wv need t-» ■ -i^lJ.-r v'.ether this 



f u.it : ? lb, at K ; * tx 'A * f hr 1 i * 



*n*.' -?-<-* 'f\V r t- « k r -» Wit?- i titi i V I U*i i \ tsA 4 *a ** . * " kL. tiii « i . i 



1; i ; i-v 1 rs J i I I I v ads-.i r. I ^ t * ■ r * ! 



1( , r; . :Vi: , ^ nv:[P ,-.M;r -nlv »nes thv invr».t i it-tt reit tr?uld hv 
trusted ftitMi^h t.. iriCefpr^f. 

m 4 . t r**-iu^ t I'M; ii 1- t4.-r.ds t.. \a* ri.i-e. l tht- sako 

* ; r 1 i 4* 1 1 * " ' • p ^ f J £ f ' ' w ' f * ,tl ^ n t r t p i -*^'T*ni* 1 t ?i • nk :rf »o^ 1 -i hi* 
fHi«:^ri-d. - Rr;::v,' »-!!r w * 1 « h*4ve t- h» ^n- amplv t 4 x- 

ar,tn--i ir J r t ; ^rr.i ry e v 1 1 **at 1 tn . hv**n it ti»*M^ are trade.-i t^ 
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ot reliability or added expense, it is still not clear that we should 
opt for individual measures to the exclusion of others. 



WHAT IS, TO B E DONE ? 

1 f " — " ■■■■ " 

0* a practical matter, the next Head Start evaluation cannot radi- 
cal J.y revamp all current measures and techniques Tor assessing child 
performance. Many issues of test development are long-range ones, for 
which solutions are likely to emerge only after years of careful 
basic research and incremental test development. In particular, it 
is unlikely that Rand or anyone else, given a six to twelve month 
period, will be able to devise entirely new item banks for individually 
administered tests and entirely new observation schemes. Instead it 
is tar more likely that the new cognitive effects battery will involve 
imaginative scavenging from parts of tests already available and 
resourceful application of various tests now being developed. It is 
important to decide what work to cut out in the creation of a new test 
battery. if we honestly feel that nc measures now exist that are 
appropriate for Head Start evaluation, then perhaps it is best to deal 
with that now rather than later, rejecting the notion that a new eval- 
uation can start within a year. If howe °r, there are certain measures 
•urrentlv available, others that are appropriate in part, and others 
t!.T i N' developed without too much effort, then we can proceed 

within the anticipated time frame. As a third possibility, of course, 
s.-.roe aav reel that the current PVHS battery is perfectly adequate and 
that the problems I have cited are occupational hazards that any federal 
evaluation must undergo, without major policy consequences. 

I would like to state biases regarding these options. I think two 
rategtes deserve consideration in planning the forthcoming evaluation, 
one I will tail -s ill -Ait* i-M-'-i and the other reliability based, A 
validi tv-based strategy follows from the point of view that the prin- 
cipal problem with past evaluations has been a validity problem: We 
wer< . n or measuring what Head Start was actually doing. Those who would 
support this strategy feel that any new cognitive effects battery for 
Head Start j.u*,t represent a significant departure from past batteries. 
It it does not have new conceptual foundations, however well It might 



46 

be administered, it will show us little that we do not already know. 
This strategy anticipates considerable time for tue development of new 
instruments, perhaps as much as two years before the end of field 
testing and the beginning of the national evaluation. 

A reliability-based strategy assumes a different point of view: In 
the past we have failed not so much because instruments were invalid 
but because they were not administered carefully enough. The task at 
present, therefore, is less one of devising new measures and more one 
of designing and administrating the new evaluation so that the integrity 
of the child performance data can be assured. We can begin a new 
evaluation soon if we can accurately estimate the cost of reliable 
test administration and design a study that permits reliability within 
known budget ^constraints. 

Each of the strategies leads to a different conception of appropriate 
next steps in planning the evaluation. The validity-based strategy 
suggests we should begin by recasting our conceptions of cognitive 
effects, letting contracts for the development of new measures, and 
arranging field tests for instruments as they are devised. The 
reliability-based strategy suggests we should start by identifying the 
best currently available measures, estimating the cost of administering 
them in the field, and considering various designs for an evaluation 
in which they would be used. 

Predispositions about the best sequence of steps in designing the 
study itself also depend in part on whether planners are validity- 
oriented or reliability-oriented. A validity-oriented group is apt 
to recommend the following steps: (1) Isolate areas of potential pro- 
gram effects; (£) devise instruments that assess change among Head 
Start children ir these areas; (3) estimate a sample size and design 
large enough to enable valid inference about cognitive effects on these 
instruments for the total population and important subgroups; and (4) 
compute the cost of the study, cutting back the design in certain areas 
if cost is too high. One implication of this sequence of steps is that 
the validity-oriented planner will end up with longer phases of planning 
and preparation and will tend to think about overall budget only after 
he has decided what needs to be measured, what instruments must be 
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developed to measure it, and how large a sample it will take to assess 
cognitive effects for various groups of Head Start children. 

With a reliability-orientation, the sequence of steps in planning 
the evaluation would be quite different: (1) Estimate the total budget 
available for the evaluation and select a minimal set of the best cur- 
rently available instruments; (2) compute the per-child cost of admin- 
istering the battery with sufficiently high reliability in the field; 
(3) divide total budget available by per-child cost of test adminis- 
tration to arrive at sample size; (4) with sample si2e as a constraint, 
figure out what design is both feasible and sufficient to answer 
important policy questions. Unlike the validity-based approach, this 
one is conser\ -ive, beginning with present testing technology and 
budget expectations. These are seen as prior constraints in deciding 
sample size and selecting questions the evaluation can afford to ask. 

The validity-based and reliability-based strategies as I have 
sketched them are archetypes, primarily useful as schematic ways of 
thinking about planning choices. No doubt the actual planning of the 
evaluation will reflect both strategies. But it is also likely that 
in the planning process one of the two modes will predominate, 
exerting marginally more influence than the other. It would be useful 
for the OCD to decide in advance which of the two makes more sense as 
a primary strategy, given current administrative and political realities. 

If money and time are to be spent on developing new instruments, 
this is a maior commitment. Probably a certain amount of new instru- 
ment development is important, but I believe it would be wise for the 
evaluation design team to devote most of its energies to thinking about 
how to execute the evaluation well, with a first-rate field operation. 
I would be happy if the new evaluation administered only four to six 
cognitive effects measures of various kinds, plus a few measures of 
related outcomes or processes to enai,xe concurrent validity estimates. 
In general, reliability of test administration and high face-validity 
are less ambiguous and more realistic as goals than high predictive 
Validity; or, as one policy-analyst phrased it, "observational power" 
is more salient than "predictive power." 



48 



LIKELY REALMS OF COGNITIVE EFFECTS 

This section offers a list of cognitive effect domains, discussing 
each and considering whether it is an area where Head Start effects are 
likely to be found. The five domains in the present typology were 
originally proposed by Sheldon Whitt. (personal communication, September, 
1973). 

I. Norm-baaed kindergarten or first grade readiness. The cognitive 
dimensions of first grade readiness have received much attention in 
the past, not only in Head Start evaluations but in other school-re- 
lated testing. The Metropolitan Readiness Test, like the PSI, is a 
well-known standardized achievement measure designed to assess prepara- 
tion for school. There also are such tests as the Meeting Street 
Inventory (Hainsworth and Siqueland, 1969), intended to screen "high 
risk" children who, because of some miner physical or behavioral 
disability, may require special attention when they enter school. 
Some of these tests predict reasonably well to kindergarten and first 
grade achievement scores, although partial correlation with achievement 
in kindergarten and first grade is likely to be less impressive when 
IQ is introduced as a control. Another problem is that these tests 
are designed to be administered only once; their characteristics in 
the pre-test to post-test gain situation, especially where alternate 
forms are not available, are not so clear. On a test like the PSI, 
with only one form, practice effects seem inevitable. 

In general, these tests have the one major advantage of face 
validity. Cognitive performance items and behavioral objectives that 
relate to first grade readiness are easier to gain consensus about 
than items and objectives in other domains. This is especially true 
if test developers do not anticipate a diversity of kindergarten or 
first grade situations, instead contenting themselves with measuring 
what is required for competence or adaptability in an average, 
white middle-class kindergarten (see, for instance, Caldwell, 1967). 
We have seen that this is a questionable assumption. 

If there is any concern with diversity of child populations, geo- 
graphic regions, and kinds of kindergartens or first grades, the task 
of developing face-valid measures in this domain becomes more difficult 
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but it is by no means impossible. It makes sense to be concerned with 
such diversity, looking for tests and test items with high face validity 
for the actual child groups considered. This may mean developing 
various equivalent measures of first grade readiness for different 
ethnic or geographic subpopulatlons. 

One of the most interesting areas for school-related measurement 
is a time-honored one: reading readiness and readiness in numeric 
skills. These need to be tested in a number of ways, not only with 
letter and number recognition tasks but also with techniques borrowed 
from the psychology laboratory To measure decoding skills we might, 
for instance, ask children to distinguish between letters of the 
alphabet and Gibson's (et al. , 1962T experimental stimuli. It would 
also be valuable to consult with the Children's Television Workshop 
team assessing effects of the Electric Company. In addition, observa- 
tion of increased interest in reading in the Head Start classroom or 
at home might be an important face-valid indicator. 

In the area of numeric skills, it seems wise to consult with the 
group working at MIT and the Educational Development Center in Boston 
on a new television program to teach math skills, analogous to the 
Electric Company in the area of reading skills. This group is devoting 
its efforts to discovering teachable components of numeric reasoning 
in young children. In addition, Piagetian measures should be carefully 
explored (Green, Ford, and Flamer, 1971). Piaget is especially con- 
vincing in talking about the shift from pre-operational thinking to 
concrete operational thinking, and the implications of this shift for 
the child's notion of reversibility, class inclusion, and other aspects 
of logic and inference. Many feel this is the kind of "math" Head 
Start should be teaching. 

Another approach to readiness assessment involves having elementary 
school teachers observe Head Start children in actual kindergarten 
or first grade classrooms. This procedure already exists in many school 
districts, where each child spends a trial half-day at elementary 
school the spring before entering the school. Perhaps this technique 
could be used with appropriate blinds to see if teachers could dis- 
tinguish between the readiness of Head Start children and that of others, 
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using a teacher checklist or rating scale. 

in general, school readiness remains a good area In which to 
evaluate Head Start's cognitive effects. Face-valid measures are easier 
to develop than in other areas, and short-term effects should have 
an unambiguous meaning that policymakers and other nonresearchers can 
appreciate* 

2. Theory-baaed developmental shifts. If school readiness goals 
are fairly clear, theory-based developmental goals are ambiguous and 
difficult to rationalize. They are tempting to explore, because we 
know that from five to seven the child undergoes a dramatic transition 
in many dimensions of cognitive process, emerging a qualitatively 
different thinker at the end of this period than he was at the begin- 
ning. But theories of development do not tell us much about how and 
where to teach. Even if we could adjudipate among them, differentiating, 
for instance, between the claims of the Piagetians and the claims of 
learning theorists, we still would not fully understand their implica- 
tions for pedagogy. This point was made by John Dewey years ago 
(1900), and it has cropped up again in the debate about the fallacy 
of trying to "accelerate" Piagetian stage-sequential development. 
Even if we have a clearly articulated, ncrm-based theory of develop- 
ment, we know little from it about those specific teaching interventions 
that will enhance development or predict to a fuller development. It 
is almost as though a norm-based theory of development is one kind of 
predictive entity and an intervention-based theory of short- and long- 
term effects is another. The latter does not follow automatically from 
the former. 

The Bereiter-Kohlberg Interchange debate (1970) again is in- 
structive, where each theorist feels the other's goals for preschool 
are trivial— not deserving of a major federal program. Kohlberg 
believes specific skill acquisition is easy to effect, but it is 
reversible and not of enduring importance. In any event, it could be 
done without all the educational trappings of a preschool program. 
Bereiter feels that stage-sequential development is an elusive notion. 
We do not know whether we can influence it with an educational program. 
Even if we can it is not clear we should bother to do so, since the 
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child sooner or later attains concrete operations regardless of early 
intervention. Moreover, stage-sequential goals cannot ever be satis- 
factorily translated into behavioral objectives. 

Each of the theorists also make concessions to the other, however. 
Btreiter admits, and has increasingly been on record as saying, that 
any specific skill worth teaching in the preschool could as easily or 
more easily be taught in the first grade. In this sense he believes 
preschool is not cost-effective. Kohlberg Agrees that the goal of 
trying to accelerate stage onset is not worthy; he feels th«» real 
effort should be in trying to avoid inexcusably late stage onset in 
some children, and to bring about wider horizontal decalage of the 
child's present stage. 

An evaluation using theory-based developmental criteria could 
adopt one of three strategies. The first is to rely heavily on Piage- 
tian measures and make Kohlberg' s preschool goals pre-eminent. There 
are certain areas of development where this strategy would be wise. 
The transition from egocentric to sociocentric activity on the part 
of the child, fT instance, is clearly important in school, home, and 
neighborhood situations. Such a shift would have obvious face validity. 
Piagetian measures also would be useful in the area of numerical skill 
development, another link between theory-based and school-readiness 
criteria of program success. 

The second approach is to sample competence in a number of domains 
of basic; cognition as indicators of developing thought processes in 
the child. This procedure might enable us to explore certain five to 
seven growth dimensions that, although largely maturational in their 
etiology, establish a backdrop for achievement gains in the school- 
readiness domain. If such processes are monitored, it is important 
that some of the instruments measuring them be non-verbal. It is also 
important that such measures use stimuli familiar to children of 
different cultural groups. Sampled competence domains might include: 

o Short-term and long-term memory, with special attention 
to memory epan and tram formations, while retaining in- 
formation in short-term memory. The child might work on 
a problem while being required to keep two other things 
in mind. 

o 07 
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o Perceptual detection, tapped by embedded figure tasks, 
reorganization of ; ami liar objects, upside down trans- 
format Ions . 

o Using a code. The child might be asked to learn a six 
digit glyph code and then apply it in some familiar 
situation. 

o Conceptual and perceptual equivalence tasks. Here there 
are lots of examples on current tests, some of them bad 
because they involve unfamiliar objects or are confounded 
with verbal response requirements. One good beginning is 
the ETS enumeration task; the second half of this test 
combines a recognition task (similarities and differences) 
with a Piagetian perceptual inference task involving mental 
transformation of a picture. 

o Simple problem solving and other inferential tasks. 

In general, it is prooably a mistake to sample dimensions of basic 
cognition except as a means of acquiring limited baseline data about 
maturation-related changes. These dimensions are important, but Head 
Start cannot reasonably be expected to have much effect on them. If 
Head Start children experience gains on basic cognition items, any 
improvement beyond the purely maturational is apt to stem from better 
rapport with the tester, motivation in the testing situation, or other 

incidental factors. 

A third theory-br.sed approach might be based on some theory of 
sensitive periods, ".he Montessori approach, for instance, espouses 
the theory that earier motor training and training in perceptual 
discrimination is a necessary prerequisite to later competencies, per- 
haps not in the sense that it represents an irreversible critical 
period, but at least in the sense that a motor substrate can be laid 
down more easily at an earlier age than a later one and has to be 
present before some subsequent capacity can develop. If the hand 
instructs the eye, according to this line of reasoning, then let us 
instruct the hand at the right moment. 

This point of view undoubtedly has some truth to it. The idea of 
prior motor schemata emerging to become conceptual schemata later is 
somehow right, although far too impressionistic even in the most fully 
articulated theories to do more than satisfy our yearning for aestheti- 
cally pleasing constructs or whet our appetite for more concrete and 
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testable ones. We certainly do not know at present how to link earlier 
instruction with later cognitive benefits— how to massage the black box 
in a certain way and a year later have it regard us with some desirable, 
newly established competency. According to some theories we are not 
even sure ab, ut how to verify the existence of the later capacity. 
Without a more convincing theory of sensitive periods, and one that 
can be easily opera*. ionali zed, we are probably ill-advised to consider 
measures intended to assess prerequisite early learning unless that 
learning has face validity as well as theory-based significance. 

3. Changes in cognitive process. Head Start may have its most 
dramatic effects in the area of cognitive process. This realm is a 
promising one for exploration in the next evaluation. Investigation 
of cognitive process shifts also overlaps conveniently with consider- 
ation of social competency, which the OCD has recommended as a 
principal focus of the new evaluation. 

There are three facets of cognitive process; each merits attention. 
One aspect is quite narrow, having only to do with response style and 
response coding in the individual testing situation. Individually 
administered tests can be coded for response style as well as correct- 
ness. Little is learned by a coding of correct-incorrect as compared 
with some coding scheme that can register shifts in the child's 
approach to the task and means of solving it. Some aspects of cognitive 
style are closely related to the thinking involved in solving the 
problem itself, such as the search strategy the child uses in trying 
to recall something he was asked to retain in short-term memory. 
Others are related to impulsivity or reflectivity in solving the 
problem, the child's technique in probing the tester to elicit clues, 
and the child's global reaction to the testing situation. Although 
the non-task-related response style factors may not generalize beyond 
the testing situation, chances are that cognitive process factors 
related to problem solving itself will. These are the ones we should 
be attentive to in the individual assessment situation. 

A second aspect of cognitive process is the transfer and actual 
performance in a larger behavioral context of new skills or capabili- 
ties that have been demonstrated in the individual testing situation. 
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We are interested that children know the alphabet, for instance, but 
we also are interested in how and when they use it in the classroom 
and the home. Observational schemes are required to study this facet 
ot process. 

A third sense of cognitive process is linked to global notions of 
social~cognitive competency and cannot be reduced to skills generalized 
or transferred from the ones measured in the Individual test situation. 
Changes may take place in learning to use adults as resources, learning ^ 
such attentional techniques as dual-focus monitoring, learning to 
select attainable and satisfying activities and goals in the classroom 
and elsewhere, learning appropriate tempo of play (at what pace, how 
long, and with what duration of sustained involvement in particular 
aspects of the activity), learning to seek good problems. This third 
sense of cognitive process has been largely overlooked in past Head 
Start evaluations. Empirically and naturalistically defined, it is 
a high-risk, high-gain area for measurement in the next Head Start 
evaluation, with much to be measured but few current instruments to 
do the Job. New measures might be based on instruments for ethological 
observation in the neighborhood and home (Barker, 1968; Schoggen and 
Schoggen, 1971; Watts et al., 1972; White and Watts, 1973; Wright, 1967). 

4. Social competency and awareness. This category has a sizable 
overlap with the process category. But here we are concerned with 
the child's instrumental knowledge of his or her immediate environment — 
knowledge about people, rules, etiquette, institutions (What does a 
policeman do?). It would be valuable to have a paper from Irving 
Goffman on "children's relations in public," trying to map some of the 
strictly kinesic dimensions of awareness about the social world of the 
neighborhood and the school, about older children and what to do and 
not to do around them, and about appropriate and inappropriate strategies 
for getting what ycu want as a child. This kind of knowledge could be 
tapped by various kinds of measures, but it does not lend itself to 
assessment in one-to-one testing situations. It is possible to ask 
the child a number of simple, direct, true and falwe questions about 
what to do and what not to do in his or her neighborhood, but attempts 
at this kind of individually administered item often are without much 
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validity. They are detached from particular local circumstances and 
surroundings or only fragmentary in what they measure. In this domain 
we should probably place more trust in observational measures— samplings 
of child behavior in the classroom, the home, or the neighborhood. 

Sheldon White (personal communication, September, 1973) has called 
this type of cognitive competence "ability to use community-accepted 
metaphor." This is apt. Head Start children probably grow in their 
awareness of cultural norms—usually the norms of two or more cultures. 
If the p-ogram helps children learn to mediate the discontinuity bep 
tween the culture of their homes and that of school and workplace, then 
we should be trying to assess this increased sophistication directly. 
Some of the mea^ires Cole is developing (personal communication, July, 
1973) in connection with his school-based research in New York City 
deserve to be considered for adaptation to Head Start. 

5. General knowledge. The general knowledge category refers to 
public knowledge any child, regardless of "ecological niche," might 
be expected to know about the world. The category includes general 
information about history, government, current events, and other areas 
regarded as important but without any immediate practical significance 
to the child. In many tests, items tapping general knowledge have 
been included both as indicators of school readiness (a dubious purpose) 
and as general intelligence items. The predictive validity of such 
questions in one-shot testing usually proves as high when correlated 
with later school performance scores as any other intelligence or 
achievement item. But they have been controversial when used in Head 
Start evaluation because it is not clear that general knowledge is 
useful or necessary for a preschooler, or that because a child does 
not know some specific fact he will suffer later in school or in his 
day-to-d...- life. It is also hard in general to make the case that 
there is any single corpus of knowledge and facts that all children 
should know. 

Certain Head Start programs may feel that mastery of a particular 
lealm of general knowledge is an important program goal. If knowledge- 
based testing is performed, it should be seen as largely a test of 
language comprehension or vocabulary. Language comprehension or 
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vocabulary items need to follow the Berko-Brown (1960) format, whereby 
a picture is shown and then two alternative sentences are presented, 
one of which is a correct description of the picture and the other not. 
Such questions are not intended to test language production by the 
child or syntactic understanding, but purely semantic understanding. 

PRIORITIES FOR MEASUREMENT 

Among the five realms of cognitive effects, I feel priorities for 
evaluators lie in the areas of school readiness, cognitive process, 
and social awareness and competency* Before we understand the pedagegi 
implications of cognitive developmental theory, it seems unwise to 
orient an evaluation to theory-based changes. It also seems wrong to 
stress general knowledge, since this realm is so hard to stake out 
unambiguously and harder still to make a virtue of mastering. 

The three priority areas have a common advantage: They are amen- 
able to assessment with a theoretical, empirically based, and criterion 
referenced meisures. For the most part issues of latent-trait shifts 
and predictive validity are finessed; the emphasis of the evaluation 
is with short-term observable changes in the child's performance and 
general behavior, as tapped by individual tests, observational schemes, 
and rating scales and interviews. These three sets of assessment 
criteria weight the evaluation primarily toward school readiness and 
the child's growth as an effector of his environment, as a manipulator 
of the immediate physical and «ocial surroundings. This is a modest 
but practical orientation. 
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It Is tempt: ing to recommend that Head Start discard all instruments 
used in past evaluations and develop an entirely new cognitive? effects 
battery- before taking such a position, however, we should seriously 
consider one variant of what I have earlier called Position 4. This 
is the hypothesis that there is no point in spending a lot of money 
to develop new measures because we have already shown what we will show 
again: A good program can get gains on any measures and a bad one 
probahly cannot. There are already at least 20 good lab school studies, 
an J now the Planned Variation study, indicating that some programs do 
Jt hiev* short-term effects on a variety of cognitive measures. If 
s ' not t-t^rm effetts need to be deraonstrated again, perhaps we should 
not qprnd money on new instruments, it. stead Hitnply pointing to the 
growing list of studies showing short-term gains and choosing a 
limited number of tried and true measures to show that the*e gains 
can be replicated by i good Head Start program. 

It wr shuttle cur measures and succeed only in demonstrating 
effects imi the s^me ;>rder of magnitude and reflecting the *arae order 
»t program difteren.es »hovn in the PVHS study, the average policymaker 
~ tv TV>f t \ )tit w *- h.«ve r«»id hii;. .tnvthlPjt new. l: a . fav.t, v . -i 

-hi it «.ould rr*akH matters worse. In the sejt« h tor new measures 
ev.iiuatots night well traue awav reliability f. r a pressed lmt*'i^ 
in ?ae**- val idi tv f risking distrust of results. 

I be 1 ieve ve ^hou Id i nvent in I r timent Jeve lupine nt uulv it £ t 
will result in a United hut excellent battery of cognitive ettect^ 
^(•auirrs that will Kl-'t - u* both i'u«re t y»Hf w-f ti.v evidence fi*:»ui!s 
in the field an.i *uhs t ant I al innnW-i in vaii-litv, espetially in 
talking about r>vt- dl 1 1 orent 1 ,»1 eitect* • high -■•■ i st pi ■.■•grants and :-«w 
i ost programs. In other wordn^ if we tan have a rf$ht design, a 
limited number cf program prototyped t<* explore, and a limited 
number .! v ost - re 1 ated questions to ask, then I w nati sited it is 
v.-ttr invest ing a . -or.s i der ab lv p-rt i^n f the ; -valu.it ii-n budget to 
level. >p new inst rument s. If. however, there ;s rr.i ind*< at t;>n that 
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children *how.:iK hi&tier elicit* tor $o&£ prugratM r and letting it 
at •fi-iS 

Lt»t prr.y.ttvl oii tue ff^TiUHtp? SiMi that we arc &fU t-rer*t ed 1st re- 

• ep! 1 1 £ I t:fi w ?v «~ n ?v **ttrrt»?ne?t> •>! *»gnittve i»!<t»rf fti ing •* 

validl t v -based strategy and devi*;infc a good new battery with a limited 
p?i>rr4'» ing tepi»rtwi re u i i *s t r ument * . In e#*< h «»f the thrre <ireas 

r:,;:'^MfvJ in >e^?J-->n 1 1 i - ■■ s*.-iw>ol ?eadine**, cognitive- pr^^ss and 
s-j-.-ial tiffipetum we need i -> begin bv lifting apt?ctfi<" rv-h.nv : .* 

' . rt'U'v<tnl -»M pri>gr*m.H- Ideallv, although it ia :.iutsfftt* 

» ht* . ..-pe ..?f thi- -t-.idy, w# n<*<*ii il^t brhnvlnr.il abject ives 

dsifei ! r:.i? prvi{t4t': t- pr^Rf-iti av -. rding t r<mI« * 
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Tnsrt-.ui we sltuuld simplv ra*ke sure that at least one instrument is in- 
cluded in the battery 'hat each program can accept as a measure of its 
cognitive effects. 

The following list of behavioral objectives in each of the three 
high-priority measurement, realms is far from complete: 

t 

■ ■ r«\;i:>rf.'::r: Does the child know his numbers and letters? 

Can he keep three bits of Information in 
short-term memory and work on one of them? 

Can he detect the difference between the 
Gibson stimuli and letters? 

Can he sustain attention on some school- 
related task for five minutes? 

Can he comprehend sentences presented in the 
Berko-Brown format? 

(an he exhibit advanced pre-operational thinking 
on certain clinical, Piagetian measures of 
quantitative reasoning? 

Dors he exhibit a sufficient level of socio- 
centric awareness when playing with his 
peers 7 Does he fight with them? 

[5i>es he use relational terms in carrying out 
a aeries of consnands ? 

■.an hi- express himself clearly enough In 
standard English to make various requests 
of his teacher and o'her adults? 

c-i? . Does the child progress toward greater re- 
flectivity in problem solution" 

is his tempo of play well modulated? How 
long Is each sustained involvement? How 
,l*>efi this differ for various activities? 

(.an the child use adults a* resources? Due-* 
h.- have a number of different strategies 
Pot doing so in the classroom, at home, or 
in the neighborhood? 

i an the child monitor one activity in the 
classroom whilf doing another? 

Can the child select something he wants to du 
and see it through to completion, in the 
classroom or from day to day in the neigh- 
t ^rh^od (sustained, goal -di r ected a< rivitv* : 
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Can the child Invent alternate strategies 
for solving a problem in the test situation 
or attaining some goal in the classroom or 
the neighborhood? 

Can the child monitor, relate to, and mani- 
pulate the desires of his peers? 

Can the child apply to a new problem a strategy 
he has been taught in several previous 
structured, problem-solving situations? 

Can the child seek good questions and does 
he routinely do so? 

Is the child more verbal, in the simple sense 
of gross production of coherent sentences? 

T.s spontaneous verbal elaboration of answers 
more pronounced in the individual testing 
situation? In the classroom? 

Is the child more observant of his older 
siblings as role models around the house 
and neighborhood? 
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S?aial Competency: Does the child understand the functions of 

various community institutions and offi- 
cials (in a culturally valid sense, not a 
textbook sense)? Does he know what his 
older brothers and sisters think of the 
police? The mayor? School? And why 
they feel this way? 
Does the child know his neighborhood— its 
geographical layout, various points' of 
interest (e.g., the library, the community 
center) and various people in these places 
who can be of use to him? 

Does t:;e child know his ri-jhts in the community? 
Whore to go if something happens to him, 
whose business it is to protect him if 
something goes wrong (e.g., the doctor in 
the local hospital, the family counselor 
at the welfare agency, etc.)? 

Does the ^hild know and understand the atti- 
tudes. «f his parents toward him and his 
siblings? And how these might differ 
from the attitudes of other parents? 

Does the child know where his father and 
m 'her work? Has he ever visited t' em 
there and does he know what they do? 
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Has the child ever visited the school he 
will attend in the coming year and met 
teachers there? 

Does the child know certain things around 
the neighborhood it would be unwise for 
him to do, either because they might re- 
sult in physical harm or because they 
would violate neighborhood or cultural 
norms? 

Can the child switch easily from dialect of 
the neighborhood to standard English and 
back? Does he know the neighborhood cir- 
cumstances under which each is appropriate? 

Does the child talk to his parents, especially 
about matters not related to his own conduct? 

This list is just a beginning. It needs to be greatly amplified 
before we can winnow the list to "best bets'* for actual Head Start 
measurement. The team designing the measurement battery needs first 
to come up with a complete list of candidate behavioral objectives 
and then to invite a group of Head Start teachers and directors to 
critique them, rejecting unlikely ones and adding some of their own. 
Certainly a list could be generated which is far more imaginative and 
more face-valid than the ones assumed in past evaluations. 

Now let us turn to the various tyi ♦ <>? •ViZ.iujw which might be 
emp I oyed • 

INDI VIDUALLY ADMlNlSThREI» TESTS 

These have been the work-horse measures in all ptevicos Head Start 
evaluations, They have fairlv high reliability and tend to be moderate 
in cost if they are not too long and tester training is not too ela- 
borate. Average m.: varies between the high tost per testing of the 
Binet, for instance, and the much lower cost of the PSI or N.Y.T. 
booklets. Individually administered tests are good for measuring some 
ispects <-f school readiness. They also are useful in measuring 
theory-based developmental gains and general knowledge, but I have 
tried to argue that these areas should not receive major attention, 
linallv, they nay he useful in measuring selective aspects of cognitive 
process, especiallv where response style is abl- to tell us something 
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about process; and in the measurement of social competency, where 
Berko-Brown types of questions and child interviews are useful. 

In general, we have erred in the past by looking at too narrow a 
slice of the Head Start child's experiential transformation over the 
year. To the extent the choice of individually administered tests is 
responsible for this myopia, tHey should not be emphasized. Of course, 
If there is little time for developing new measures, a new evaluation 
may still have to rely heavily on these instruments. 

Individually administered tests of cognitive performance are 
currently available in a wide variety, although in many cases their^ 
ready application to Head Start can be questioned. Readers are re- \ 
ferred to the ETS summary of available tests (1968), and the Huron \ 
Institute report on the PVHS battery (Walker, Bane, and Bryk, 1973). [ 
In general there is a great need for better empirically based, external\^ 
validity based, and criterion-referenced measures. Some new tests in 
this category deserve attention. In particular the ETS CIRCUS (1974) 
developed by Bogatz and other Sesame Street test developers, might be 
adopted. The ETS CIRCUS attempts to extend principles of Sesame 
Street test development into more general preschool and early elementary 
school testing. It is a promising criterion-referenced battery and 
might lend itself wholly or in pa^t to He;? i Start measurement. CIRCUS 
has the additional advantage (or hazard?) of including various tests 
administered to several children at once by a single tester. If reli- 
ability can be maintained, the cost advantages of such a scheme are 
ob v i ous . 

Piagetian clinical measures also should be explored, and scoring 
Tor response style should be mandatory on all individually administered 
i nst ruments. 

New testr also might be based on what we have learned from past 
Head Start testing. One approach would be to look for items or sub- 
scales from instruments used In earlier Head Start evaluations that 
explained a high proportion of generalized gains or between-program 
variance, then creating new scales from these for the new evaluation. 
This process «f scavenging would require ^dependent assessment of the 
validity <*nd reliability of the new composite measures, hut the new 
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instruments might be more valuable than the older ones. Alternatively, 
we might find out from past evaluations which subscales and items were 
most reliably administered, using only these in a reduced scale of one 
or two factors. The assumption in this case would be that reliability 
and cost were primary considerations, within some validity constraint. 
Test dimensions might be reduced with a significant increase in reli- 
ability per dollar. 

CLASSROOM OBSERVATION INSTRUMENTS 

Classroom observation schemes often have been of low validity in 
the Head Start classroom because they were designed for investigation 
of another kind of classroom setting or because they have tried to 
monitor too many aspects of classroom process at once. In general, 
the only kind of observation instruments that should interest us in 
a new evaluation are those enabling exploration of particular hypo- 
theses regarding face-valid behavior changes over the Head Start year. 
These may by hypotheses closely linked to performance on individually 
administered instruments or they may be hypotheses not amenable to 
exploration in any other way, such as those concerning dual-focus 
monitoring or sociocentric play. Best bets should be made in advance 
about most likely face-valid changes and should determine what is 
observed. It is too late at the time of data analysis to dredge for 
interesting results. Classroom observation measures tend to be more 
costly than other measures, both to administer anl to analyze, and 
they are apt to be less reliable, especially if they require collection 
of large amounts of information in a short observation interval. This 
is another reason to be clear in advance about hypotheses to be explored. 

It probably will not be necessary to devise entirely new observ- 
ation instruments. The ETS PROSE (Medley et al. , 1971) has many in- 
teresting aspects, as do the Bankstreet measures, some of which assess 
motivation and curiosity (Stern and Gordon, 1967; Cohen and Stern, 
1968). But it will require real skill to select components of current 
inst uments and adapt them to the. specific dimensions of classroom 
proivss that interest us. 
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HOME AND NEI G HBO RHOOD DBS ERVAT ION INSTRUMENTS 

One area of measurement never attempted in a Head Start evaluation 
is naturalistic observation of the child in his neighborhood or home, 
Ethological or ecological assessment of Head Start's global effects 
would be valuable, especially if measures could be devised to explore 
questions about cognitive process and increased social competency and 
awareness. It is of obvious importance to policymakers that we measure 
the child's actual conduct in the world outside the Head Start center. 
Measures of behavior outside class have a built-in external validity 
other instruments cannot claim. The OCD is now interested in global 
evaluation and may in the future want to emphasize Head Start's 
effects on the family in the fuller context of neighborhood and home 
(Bonf enbrenner , in press). 

In general, home observation strikes me as a high-risk high-gain 
venture. We know little about how to do it for Head Start children 
or about which hypotheses to explore. (Does Head Start give a child 
more poise in dealing with his mother?) But if effects could be 
demonstrated it would be powerful evidence of Head Start's value. 
Among the few good measures in this domain at present are those of 
White and Watts (1973), looking at parent child interaction in the 
home, those of Watts (Watts et al. , 1972), for infants and toddlers 
in daycare centers, and those of the Schoggens (1971). None of the 
measures are fully appropriate for use in Head Start, but they might 
be adapted. 

Neighborhood observation outside the home and the Head Sta.t center 
is also tempting, but it raises even greater difficulties. Observers 
night follow children from place to place as they played, in the 
fashion of some of Piaget's earliest work or the work of Barker (1968) 
and other post-Lewinians. There are problems with such an approach, 
however, unless we could agree on indices of increased social aware- 
ness i-r competency and could do the assessment in .•'/•<" .."• ; settings. 
Two possibilities for structure come to mind. We might ask that 
children go with th.-ir mothers or other familv members to certain 
neighborhood stores and services and then observe their inter.i* t Urn* 
there. Altern.it iv.-lv, we rr.ight assign the child Virions t-i^ks to 
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perform in the neighborhood, reminiscent of a treasure hunt. If we 
want to know whether children can find the fire department or know how 
to talk to a policeman, we may be well advised simply to design a 
task that has them do this. Children might be given a list of things 
to do and then be observed wr 1e they did them, or assessed afterward 
according to whether they were able to do them. Control children 
might be given the same tasks. 

PARENT t TEACHER, AND SIBLING INTERVIEWS AND RATINGS 

These approaches generally fail on grounds not of reliability but 
of face validity. It is not terribly convincing to be told by parents 
that their child is now more competent than before, or to be told by 
a Head Start teacher that Head Start children are performing better 
than controls. Moreover, blind procedures that might enhance validity 
are clumsy and expensive. But here again there is a realm of imagin- 
ative, face-valid measures that tr. '.t be considered. We might, for 
instance, collect data from kindergarten or first grade teachers on 
the placement of the last year's Head Start children. We might also 
interview parents about how children have changed in their preferred 
activities. Both of these approaches could be valuable, at least in 
the preliminary stages of developing observational instruments. They 
would help us gather information from teachers, parents, and other 
neighborhood people on best bets for specific areas of behavior co be 
assessed. 

Ij^UMEN TS FOR COLLECTING INCIDENTAL FACTS 
ABOUT REDUCTIONS IN SOCIAL COSTS 

One category of instrument overlooked in the past that should be 
given attention under the rubric of cognitive effects measures and 
elsewhere in the evaluation is the catchall category of facts about 
.!...•::/ :.w»N i by Head Start. David Weikart (personal communi- 

cation, July, 19 7 3), for instance, has had success in promoting his 
program simply on the grounds that it results in fewer children being 
assigned t>< MR classes in school, with resulting cost reductions to 
the taxpayer. Weikart can r.ake the case that earlv equation, at 
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least in his program, is cost-effective. Such effects prcbably would 
not show up as dramatically among children in field sites as among 
children in the Weikart lab school program, but there is no doubt that 
this "cognitive effect" criterion is important. 

Such a measure shifts the burden of proof for cost-effectiveness 
from Head Start to later programs, which will have to cope with un- 
treated problems in the event a child does not attend preschool. Head 
Start probably has a number of such benefits, resulting from screening 
procedures of various kinds and from the child's being more aware and 
better socialized than before. We might explore incidence of undiagnosed 
problems of sight and hearing in the year after Head Start, incidence 
of children's involvement with the juvenile courts, and incidence of 
problem behavior in the kindergarten or first grade. 

BALANCE AMONG TYPES OF INSTRUMENTS 

Among the various kinds of measures, it remains for us to decide 
an appropriate mix given what we know about currently available tests 
in each category, how much money we have to spend to develop and 
administer tests, and how much time we have to design new instruments. 
Money, timing, type of instrument, predicted levels of face validity 
and reliability of administration, all of these must be weighed simul- 
taneously before we can tell OCD "what it should want." These consider- 
ations cannot be sorted out fully here, but some generalizations can 
be made. 

First, we need to keep the cognitive-effects evaluation simple. 
The fewer instruments the better, if the purposes of the evaluation are 
well-served by the ones chosen. I believe in George Miller's magic 
seven ?>ius or minus two, preferring to err to the minus side where 
bureaucrats and legislators are involved. Most consumers of Head Start 
evaluations simply are not able to digest more than five or so measures 
in any given domain and make sense out of results. If we could 
sent our findings on five or six dimensions of cognitive effects, 
perhaps showing differential program effects as Lesser, Fifer , and 
Clark (1965) did In their profile analysis of patterns of mental 
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ability this would be interesting and comprehensible to a wide 
audience. Anything more complicated, i .volving lots of second and 
third order interactions and differential effects on multiple instru- 
ments, serves only to confuse everyone. 

Second, it is important to decide how much additional instrument 
development is necessary in the area of cognitive assessment. As I 
see it, we can assume one of two stances on the matter, one quick and 
expedient, the other slower and with potentially higher yield. These 
are extensions of the reliability-based and validity-based strategies 
mapped earlier. The first choice is to give much more centrality to 
cognitive measures and to opt for conservative, highly reliable, and 
politically compelling data. Planners forgo any attempt at new 
cognitive instrument development, accept some version of Position 4 
that cognitive effects should be a moderator variable — and adopt a 
limited assortment of the best currently existent measures. Some will 
be the same as the ones in the PVHS battery: perhaps an IQ measure, 
in acknowledgment of its political currency, along with certain 
criterion-referenced measures such as the WRAT, the PSI, or some of 
the NYU booklets. Perhaps one or two ot.ier measures in the works 
can also be selected \e. g. , ETS CIRCUS tests). 

If we adopt this approach, heavy emphasis must be placed on indi- 
vidually administered tests, to get highest possible reliability per 
dollar. No current instruments in other domains can match the indivi- 
dual tests in this regard. Classroom observation would have to be 
limited drastically, to explore only a few specific hypotheses about 
transfer effects of individually tested competencies to observed 
activities in the classroom. Cognitive process assessment would be 
limited to what could be learned from coding schemes for cognitive 
style on the individually administered tests. There would be no 
assessment of social competency or social awareness in the cognitive 
domain except what could be gleaned in the individual testing situation. 
Parent and teacher interviews would be downplayed. Certain face-valid 
facts of interest to cost analysts, of the sort mentioned in the previous 
section, would be collected. 
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In general, cognitive effects assessment of this sort would serve 
the purpose of demonstrating that something reliable was happening 
in Head Start — something we knew about sponsored programs before but 
that needed to be shown more carefully. It would not involve any re- 
conceptualisation of cognitive effects measurement. There would be 
only three differences from past evaluations: the battery of cognitive 
measures would be more limited; it would reflect an emphasis on face- 
valid, criteria-referenced, short-term program effects; and it would 
be much more carefully administered than before. 

The second option is more to my liking. If planners could get 
the concession of more time from OCD it would make sense to spend a 
year developing a new battery of tests, limited in number but designed 
more carefully with Head Start in mind. A battery developed with 
this much lead time could include individually administered measures 
but also other kinds of measures, striving for equal levels of reli- 
ability for all. Agreeing upon a set of behavioral objectives in the 
domains of school readiness, cognitive process, and social awareness 
and competency will take time. It will then take more time to devise 
instruments that measure these objectives to everyone's satisfaction. 
Teachers or other Head Start field personnel have to give their 
opinions about which objectives are most important. Then instruments 
must be developed or adapted for the subset of behavioral objectives 
chosen, and these measures pre-tested. This is not a process that 
can be accomplished in less than one year. A new battery evolved in 
this fashion might include two Individually administered tests, two 
observation schemes (one in the classroom and one in the home) , one 
wild card instrument assessing child competence in various tasks 
either around the neighborhood or in the kindergarten and first grade 
classroom with older children, or both, and perhaps an inventory of 
social cost-benefit indices. 

Regardless of which option is chosen, it is abundantly clear that 
we are not ready tc launch another three-to-five year longitudinal 
study immediately. Under option one we would be doing little more 
th-U) replicating certain aspects of PVHS, with better data hut with 
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no possibility of changing to better measures. Under option two, we 
need a period oi test development and then a field trial before being 
ready for another ma}or evaluation. 

M^SUREME NT STRATEGY JVND EVALUATION DESIGN 

Design of the evaluation is closely related to choice of instruments. 
If the OCD selects a desiRn comparing a more costly but presumably more 
effective sponsored program with traditional programs and non-Head 
Start controls, for instance, then the final layout will have three 
levels— a sponsored group, a traditional group, and a control group. 
Assuming for illustration a sample of 600 children, each treatment 
group would have 200 children. These might be all the children from 
a limited number of centers, or if OCD was willing to pay more, could 
be a group chosen randomly from among the children in ail centers of 
the appropriate treatment type. Notice that even with this simple 
three-level design and with no mention of other independent variables, 
the study would be down to two hundred children per level. 

If the OCD is interested in a good study, without hopelessly 
unfounded variables preventing val».d inference, then it needs to 
realize that as more and more independent variables, covariates, and 
other controls ,ire int roduced, cell sizes can diminish rapidly to 
nothing, or next to nothing, leaving us with the problem th.it characterized 
the PVHS design. This should not be petmitted to happen. I will try 
t.> be more specific, to show concretely how hard it is to avoid the 
temptation of trving to investigate too much at once. 

Judging from all wi> have learned in past Head Start evaluations, 
the following variables ire important to stratify on or control: 

geographic region 
urba: /rural 
ethnicity 

SLS 

m 

age (four or five, or broken out it monthly intervals) 
previous pres. hool 
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No evaluation can gloss over these source* of differential effects. 
Not only must good data be collected but comparison groups must be 
well planned. Returning .o the hypothetical estimate of 200 children 
per experimental level, let us make the further, not unreasonable, 
assumption that for reasons idiosyncratic to the study design, cell size 
catihot drop below 15 and still allow reliable estimates. Now evaluators 
probably could only consider three additional independent variables to 
cross the sample on, not the four or five we might like. Reviewing 
the candidate variables on our list this could lead to difficulties. 

SES and IQ would not be the problem. The range of SES in the 
Head Start population is greatly circumscribed a/id does not seem to 
explain much of the variance in outcomes in past evaluations. SES can 
be entered in most analyses as a moderator variable or covariate. 
E valuators also can covary on IQ, unless tha design is intended to look 
at special benefits for iow-IQ children. There are large enough numbers 
in each treatment group to make sizable differences in group IQ means 
unlikely. EthniHty is more of a problem. No evaluation can disregard 
it, but to make it a prominent Independent variable does not seem 
advisable. In any event, moat Head Start children are black. Probably 
the best solution is to make sure of a roughly comparable racial mix 
tor each cell in the design, simply choosing children oi iei»U-n. ior 
inclusion in thx» study with the understanding that ethnicity will be 
controlled by initial stratification. 

kojrvphis region and urban/rural also have to be considered. They 
cannot be finessed, since programs in one part of the nation often are 
quite different from those in another, and since a program in the 
Lountry usually differs from one in the city. We need representar ion 
in each at these areas lor policy-relevant inferences about Head Start's 
nationwide effects. There is, however, the possibility of ' 
pooling of data in the event that no statistically significant dif- 
ierences are found between groups. 

Finally the real bugaboos — **- and pr«.:*:J?& . there 1- 

no question that a test for a four year old is not the same as a test 
for a five year old, and maturation-related changes in children's 
thinking make a tremendous difference in the utility of (riven measures. 

Si 
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Vans, for instance, thr PSi it* generally acknowledged a* a test tor foul 
vear ,lds, with definite ceiling effects 4ur fivvs. lo sake matter* 
r.oro complicated, in thr PWS «tody it i« clear that prr-vious preschool 
made a difference in live year olds' perionaance on roost of the indivi- 
dual lv administered test*: Cains are not as great if 3 five year old - 
is in his second Head Start year. There are a number of possible explana- 
tions t<-r this phenomenon, ranging trorathe leant Important (Increased 
faniiliaritv with the pre-tester in a second year of the program) to the 
most I ©port ant Ueduced marginal utility of Head Starr influence in a 
second year of the urogram) suggesting that we should have only first 
vear children. 

in the pa-A, suae Head Start evaluates l.avc c ' abliahed age by yixir ; 
a;, an independent variable, others have chosen age by month, others have 
introduced age by month as a ccvariatc nr damn- variant in regression 
^<u.-Uions. but no result is I ul iy satisf at -i.-ty if ..«■••■ tost t« 

measuring different things for children of different agos . Perhaps ^ 
children should be chosen tor the evaluation from an age r^nge no wide* 
than a .ear, or center * should be fltflc :tt»d with • hildren in a narraw 

• , , s. r t > iv ; t .«*. -ent rat ing "n }<>ur vear -vl<is, 

age range. iin-u is ~ •« i- 1, i''. 1 ' ' ...t. »•»•»■ p> 

skirting the problem of previous preschool as an additional variable, 

v... t t-gt -n-,, espeMai !v the South, win-. Head start programs 

, . . , . , , r t , , . :, rt... , ,. , » ,. • , • • * <■ : • r rk i rsde rear' e*i 

I ead dire-, t i y tutu i ir>t g t .iut ■ » « «• ■»• 1 - - • i f> 

children jif," ; .is** there are few k lmUOrgar t ens . "hi* is-nie drier* r» - 

r ,. d jiag. It i-. dire-. My related to the d„-l., • t relevant measure -; 

i ., , r ,i< it?.. «,•<< »» «s -it - i',dv rb.r kind of ? hat in t he 

and l!;e .r v.i« «.u 1 1 r , •*» » * *■ 

p.,st has been ,-verl.-» i«fd, re-.eittng in a nlghtnare ?■■ s dat a analvsta- 
-t ,»s nut be afraid to delimit the «<tudy womewha* fit will wean we 
. .in "ia-.e Siv.«re trust ii> what ut 1 find. 

\l fright also be interesting to reconsider t:.e rtutt-ber ->t testings 
during the Head Start vear, sieving iron the pre-p-st !e*lgn ;>t pant 
evaluations to a tin*- serif* design with three testing*, or a deaign 
witt:. true random! rat I on of j,ub1<K-t3 and a single criterion testing 
iu ttit spring. 

^valuation planners shtmid at leant consider reducing the number 
instr^nt-i \v\ t\\* ^atttrv, iteming b* r** l « » * 
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.. .... t . r ,« .jtirr-t ni s ? t-s ■-,<.•"•*• in-rrtiaicnth three liases during the 

ii.L-j. would v»**ld tntormat ton about trends jnU about ttu* time?* 
^ v.Mr when R.jin is caking niace. H ir were dis- 

,. iiv ,.„ .< ( .... tnntanvt;, tlt. 4 t ..h listen were experiencing a large p'^rion 

,»t !}••.» J St,vt improvement during the interval lr<«tt October to 
• r ,t iwpiwvingvo.ly marginally tf.ereatter. On* ^u;«i •» 

lt! . jr . . | . ir.iK-rt^,' There are .»Uo problem* with thre«> 

» 1 .,« i „ r , •v.-.wrver. In i uv abaea* e c-f altvrnatlve Snrtia for most 

, . . t .. ■..,.-» « /\ fl * ♦ n " ► •■ ft* ' • i Vt* 1 •/ . ^ i » ' * i"' * "* 

. ... , iT ... : v f. ; the testa pr orient 4 a logistical piyjfcfea t>£ iarge pr<>- 

....... . , in j p:*t&-ncial irritant f«* h-.-ms ^t^rt center *»t f I Uin. 

ti; -..« K v >.i]..! »ti -e r.» he^in slaosf ,k; r.e-: n as the previa <w wdi-a. 
.-. iD? ;;sUf :u-n.*, ,»4minis' "t • ri u-riufl- referenced re*,x* 

w -til $nt . r<-»t ing it titer? •••~yld he trut- randomization 
-. . • , : >,.^> ..r an adequate approximat ion f> t it. Such a design aight 
... :. .»j.i*fwxiPiat<J » full lattery attain !'.r cred -it po«r-te«»» 

.... ,.. v -. is:ir ._. t j ftv< «i flf s r;f . r instruments given fit pr«-- ! ••- ' 

-;«i.-,rt «i thiu appr.v-.- « -il-i *•«• .tn.tinp-"* to the MatW.nai Au.it.-f.- 
T«.nt -^r»»egy, hy v>t»... . . -t Llee* ttv givvn .»r»*i :. • ! «: 

f i . Jj, - * . *• r- t r t - w- ».■>-' « ' .*^*»* ■ ■ -■ 

--«•■*«■ ■ • 1 

t... r, ■. fl'. J, ». >Wi'V*'l , it WUItiJ 

:it . . . : . jr! <• .^v,ut^ . . - •■«:• SCHtgt: wfff • 

i.it til swort-mt ba v.irt*Mv* ana : »« inlet action-, /t tnc^e 

x T i ^ ^ ■ . ■ j ; ; ■ • . t v* * ' • m » «? i 1 -r 4 v i r v C . * , * ^ J - • - •* • , • 
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in audio t- tltv .^ftusption thai diildren -r.. -. i tr. » t.«s - salie 
ir.ir.| variable : r-r. re!! 10 cell, it would h* ncoc-.s.,"v to a»t»in» 
that eagnitude C7i gain* can He compared from item to Item. AImo, items 
w .,»!d have to ba anal v^ed individually «nd not as part of scales, 
^ntng that no independent validity or reliability estintateb would Ue 
• •.-,«*•? * b 1 «• b.i«*ed l?efl*-sci!e characteristics. 

\ nwi area that has never received adequate consideration Jr. 
Mi-,.: M.ir: evaluation* i 5 the i»*ue ot decision rules for prograis 

c.,.». ruips fthuuld be made explicit fv/V?i*- the evaluation 
begins, that it is clear that a gain of X anputU un Y scale, nr 
. m ♦■ ... ir.<fas B, i'-, and D, constitutes a sufficient demonstration 

.. .... u s«,.t This consideration take-; un hack to the he- 

, ..-M.rt. where it was suggested th*t until there is t'-tsv 

r - .-^^v ' w»-.»r t akt-v t<> convtnit appropriate audiences, 

? t,«- in;ts.sti:d. tfnee suth an understanding 
•-..i.*-. baucd on it should D,ide explicit, 
t:-! i,^n formal lv ^iat^d in ttu- F as * l^au**- - * Ml 
v< ir iiifdU- various different I *ur« i^n* rrf 

:. ; ... r ^ . ? .- s i-sr.it t ht'!s«yc 1 vf-- tfu* OCD in advance to 
^ t - r tJ| i u . r._ ic^-pp op? s^tfi :>p**fc » s ».«:ute r h t and.'**' i r > 

• * • -. i« in .mt i • J p.itvd -Mid 

. 4?5fc? - : e .::;pri?feA in t>«c- 'J.jf-t. Ku« it rvnetholeMS lwport*nt 

■ :| ,iw ? . -M-p. ? M--*- : <v* interpret*! inn* 
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The West tnghouse- Ohio study tried to demonstrate systematic, 
sizable cognitive effects tor a randomly selected group of Head Start 
center!* , comparing children in these centers with non-Head Start 
controls. It found only slight effects when ail prograaas in the sampl 
were aggregated and mean gains were assessed. it would be a mistake 
for the next Head Start evaluation to recreate the Weatinghouse-Ohio 
study, even with a better design and better measures of cognitive 
effect*. Whatever the erasures selected* effects probably will not 
be large enough in such a design to cor- > nd respect from poiicytsaker« , 
ev* n if :;aaiple size 1 *rge enough to y,l .«> rb«»ra a fair chance of 
beimt ,t 1st icaJiv significant. Substantively, much is obscured by 
anal v/ in« progras gain data enly at the highest level of aggregation. 
It i-. not surprising that overall effects are slight; programs differ 
widely from one anothet and we know some are good and others hid. 
ft f ■; heller to tsk .'n: ••ft prngr.sm.s are doing well and v^jy. 

It in also important t«. »hift t hv terns of the debate about 

gain*. This is not i -i i r criterion for Head Starr rvaiu<ilio-. 
l.titf it .>d»nal eife. it, sight *•»■ po-^ible t<> acmon*rrare, .t |.-.imi i„t.- 
Utfit and ^lusiJ giad»--they have been demonstrated in nmaU.-r studio 

it w .aid b»- unwise to :,pend the mom*y necessary tor -i < mtul 
longitudinal evaluation design exploring this aspe< t of gain, 
especially wb.-u rhe roagnilude of effects probably i:: n<>t great. This 
in i criterion of pro-rata funding not imposed on other federal pro 
Bran* -r Hber levW* of schooling and should simply be rented. 

::.<• i'ianned Variation Head Start Study w.w an •'i.ih = -i st« ?..*tu:a. 
.*,.*•! iwnt th-tS t.ad u« less than we had hoped. It war. intt Sited 
before ye wr- 1 <• s^re (ai we ruuid implement varfua *pon>-red . r>gr »«*. 

•?■.■,!!*,••■ tbev were p ir! "* f '•** w t f»-.»t.s*n» , { f <'tt f hv •■ t~e ?<■.-;• tat 
«h< we <rmj!d '-»»n»r«»i ne» »•♦»*•.*! y viriubli:. In ^rdor !•■ sjifce yiHo 

tiifvt'TKf:, 'e> we ha:}, a battery of measures rh.<: * uid fell wi abeul 
f!,.- ,iit»fr>-nf.t! efie-tn pr <gf a*t»- , and fJJ t?;ese rv'.:<! woji i 
t ■ i . i„ . » »..- » ♦ * v«_ i t b v . i t we 'i 1V1 i i r r. « ■ r ' 
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.tot-thing tr-va "VMS it is thai bigger is not no.o-;^riiv bctltrj t and 
Lt natural exp«r Intents in education, while they represent a signifi 
cant iaprovement over to.? survey research, h.-ivt- problem of 

their cwi. The next evaluation, if it chooses to explore the inter- ' 
action of program type and child group, should perhaps develop hypo- 
theses from the PVHS data, but it should be much smaller *nd more 
.-at. 'fully designed and executed. Randomization of children, clear 
definition o! t r .'at tm-nt s , and adequate control* .ire needed. 

One strategy for evaluation planner* in cosing months would be 
rr.L-.dilit* htUU'd. It would begin with a fixed budget and brief time 
4r. t tn«- as i »n«?t rat nt s : instruments would be selected frosj those cur- 
f-ntly available or neatly developed; sa«iple size wou>J be determined 
>. v dividing per .tali tost of reliable test adminlst r«t 1 on Into the 

, j. . ur , v . ,« l.iM». ' »r tr*;t i^ininr.i! ion. This approach would 
..r.j V f. r - r letter 'lata en a limited nutabei of familiar meabutert. It 
w.uild also be . ompat ib K- with a de* I A ton tc make cognitive effe<t* 
m,..,.,„i. n ; .i:f 2,-f.;# i m { : tant in this evaluation thAn previous »tu-h , 
........ , .p.fiv.' mm t rtimettt s •»-. moderator variable-. In a -study thir 

.-.«,.». a^.i /ted effect- wrside the .ognttive domain, Pel haps Hitch a 

,»...}v w-.iM J.t« «e. en the area*, ol health .u.d nutrition, devflnp- 

1 pi i ■ 

r • . » » ■ f . ! ;* t. * * V ? • < •» "f> t hu t i -it « i;rw i:v,*a . - 

....,.ft, tt ixaMiity bred -.tr.'lrjsv «enutble- 

• . ....... ; . ..ut t i . utit time to devi>»: a »»••« K. a-.! Star! • ; 1 ■•«' 

Mrtrrv, thr't t : v ! -l ; l »» rateRV mav be pussible in 

r » , y , V; ,j„ii wiM: cuhr.tant «al ef^r'.s tn.-»d*- develop new 
, tl .« -,.f.- v »t i I Instrument*:, lest deve* pment cowid pn.teed In the 
_ , . . i...... i r , f . ? f j. t ,=.gnittve process, and U! awaf#-new«. 

a?) • .mpvtvttt. v. rt.Tiettiiy ivailat«5« >a«ittuni'r»f«. an ' - -s-.l«»;*t «•<? . 

hut it if, -crtnt! thaf a bat'e'y d*«p.j» t i«gJ-*» g«i ii< ant Iv f?- f :r. the 
proven* o»»»- will take n'> than a year In develop . The extra time 

Wu ,,id be w.. r tt, t.thn K if ^ Rfitive e'fe-r j ,r> ay.iln t« plav a cn'r.i 
role i ti the evaluation. 

Regardless whl.h tiategv plannvtv: tdopt . <>nr «t*V -fart "t o 
.. f «,»,.., ^ v <t. : atiot: i>. that the -jer t i r-.n h« well ads.t ni st ef d . 

- \ j 
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in ysAul Head Mart evaluation, >Me :st>sr bKtc -isppi-ts of test admin 

i«.r r.tr I*.'!) and ti.it. \ vulliiLttiMi ti.ivr (ton* wrung, resulting data 

dubious vajur. Under no circumstances should this error bo repeated 
• « 

It :r.jk^- 2*<V to administer 1 fw ^^ur^ wi-11 than a l«»t of 

merisur'/.^ ;Trof }v. TMh is tr u* even it it n^'trr-* r =■> mr.* 4 extent sacri 
tUtnu vx:vrt~*l validity w*?iIti»R ir -r i 1 * r . t-nfatlvc 
n4ti<>m*l s^pli- and a Iar*c number of I n r »trun»M;t a . 
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